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Abstract 

Given data drawn from a mixture of multivariate Gaussians, a basic problem is to accurately estimate the 
mixture parameters. We give an algorithm for this problem that has a running time, and data requirement 
polynomial in the dimension and the inverse of the desired accuracy, with provably minimal assumptions on 
the Gaussians. As simple consequences of our learning algorithm, we can perform near-optimal clustering of 
the sample points and density estimation for mixtures of k Gaussians, efficiently. 

The building blocks of our algorithm are based on the work (Kalai et al, STOC 2010) [17 that gives 
an efficient algorithm for learning mixtures of two Gaussians by considering a series of projections down 
to one dimension, and applying the method of moments to each univariate projection. A major technical 
hurdle in |17j is showing that one can efficiently learn univariate mixtures of two Gaussians. In contrast, 
because pathological scenarios can arise when considering univariate projections of mixtures of more than two 
Gaussians, the bulk of the work in this paper concerns how to leverage an algorithm for learning univariate 
mixtures (of many Gaussians) to yield an efhcient algorithm for learning in high dimensions. Our algorithm 
employs hierarchical clustering and rescaling, together with delicate methods for backtracking and recovering 
from failures that can occur in our univariate algorithm. 

Finally, while the running time and data requirements of our algorithm depend exponentially on the 
number of Gaussians in the mixture, we prove that such a dependence is necessary. 



1 Introduction 



Given access to random samples generated from a mixture of (multivariate) Gaussians, the algorithmic 
problem of learning the parameters of the underlying distribution is of fundamental importance in physics, 
biology, geology, social sciences - any area in which such finite mixture models arise [211 [31] . Starting with 
Dasgupta ^ , a series of work in theoretical computer science has sought to find (or disprove the existence of) 
an efficient algorithm for this task [H [TOl [331 [1] HI |3] . In this paper, we settle the polynomial-time learnability 
of mixtures of Gaussians, giving an algorithm that uses a polynomial amount of data and estimates the 
components at an inverse polynomial rate under provably minimal assumptions on the mixture (specifically, 
that the mixing weights and the statistical distance between the components are bounded away from zero). 
As a corollary, our efficient learning algorithm can be employed to yield the first provably efficient algorithm 
for near-optimal clustering and density estimation, without any restrictions on the Gaussian mixture. Finally, 
we note that the runtime and data requirements of our algorithm are exponential in the number of Gaussian 
components; however, as we show in Section [6j this exponential dependence is necessary. In the remainder of 
this section, we briefly summarize previous work on this problem, formally state our main result, and then 
discuss the differences between learning mixtures of 2 Gaussians, and mixtures of many Gaussians, which 
motivates the high-level outline of our algorithm presented in Section [2| We first define a Gaussian Mixture 
Model (GMM). 

Consider a set of k different multinormal distributions, with each distribution being defined by a mean 
e M", and covariance matrix G R"^". Given a vector of k nonnegative weights, w, summing to one, we 
define the associated Gaussian Mixture Model (GMM) to be the distribution yielded by, for each i = 1, . . . , fc, 
taking a sample from A/'(^i,Si) with probability wi. Letting Fi denote the multinormal density function of 
the I*'' component, Af{fJ.i, Si), the density function of the mixture is: F — WiFi. 

1.1 A Brief History 

The most popular solution for recovering reasonable estimates of the components of GMMs in practice is 
the EM algorithm given by Dempster, Laird and Rubin [11]. This algorithm is a local-search heuristic that 
converges to a set of parameters that locally maximizes the probability of generated the observed samples. 
However, the EM algorithm is a heuristic only, and makes no guarantees about converging to an estimate 
that is close to the true parameters. Worse still, the EM algorithm (even for univariate mixtures of just two 
Gaussians) has been observed to converge very slowly (see Redner and Walker for a thorough treatment [?7]). 

In order to even hope for an algorithm (not necessarily even polynomial time) , we would need a uniqueness 
property - that two distinct mixtures of Gaussians must have different probability density functions. Teicher 
[30] demonstrated that a mixture of Gaussians can be uniquely identified (up to a relabeling components) by 
considering the probability density function at points sufficiently far from the centers (in the tails). However, 
such a result sheds little light on the rate of convergence of an estimator: If distinguishing Gaussian mixtures 
really required analyzing the tails of the distribution, then we would require an enormous number of data 
samples! 

Dasgupta [8] introduced theoretical computer science to the algorithmic problem of provably recovering 
good estimates for the parameters in polynomial time (and a polynomial number of samples). His technique is 
based on projecting data down to a randomly chosen low-dimensional subspace, finding an accurate clustering. 
Given enough accurately clustered points, the empirical means and co- variances of these points will be a good 
estimate for the actual parameters. Arora and Kannan extended these ideas to work in the much more 
general setting in which the co- variances of each Gaussian component could be arbitrary, and not necessarily 
almost spherical as in [5]. Yet both of these techniques are based on the concentration of distances (under 
random projections), and consequently required that the centers of the components be separated by at least 
^/n times the largest variance. Vempala and Wang [33j and Achlioptas and McSherry [T] introduced the 
use of spectral techniques, and were able to overcome this barrier (of relying on distance concentration) by 
choosing a subspace on which to project based on large principle components. Brubaker and Vempala [4] 
later gave the first affine-invariant algorithm for learning mixtures of Gaussians, and these ideas proved to 
be central in subsequent work [T7] . 

Yet all of these approaches for provably learning good estimates require, at the very least, that the 
statistical overlap (i.e. one minus the statistical distance) between each pair of components be at least 
smaller than some constant (in some cases, it is even required that the statistical overlap be exponentially 
small). Recently, Felman et al |13j gave a polynomial time algorithm for the related problem of density 
estimation (without any separation condition) for the special case of axis-aligned GMMs (GMMs where 
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each component has principle coordinates ahgned with the coordinate axes). Also without any separation 
requirements, Belkin and Sinha[5] showed that one can efficiently learn GMMs in the special case that all 
components are identical spherical Gaussians. Most similar to the present work is the recent work of Kalai 
et al \XLi, that gave a learning algorithm for the case of mixtures of two arbitrary Gaussians with provably 
minimal assumptions. 



1.2 Main Results 

In this section we state our main results. To motivate these results, we first state three obvious lower bounds 
for recovering the parameters of a GMM F = z2i=i^i^i^ which motivate our defintion of e- statistically 



learnahle. We provide a formal definition of statistical distance in Section 2.1 



1. Permuting the order of the components does not change the resulting density, thus at best the hope is 
to recover the parameter set, {(wi, /ii. Si), . . . , (wfc, /i^, S^)}. 

2. We require at least VL{1/ miTii{wi)) samples to estimate the parameters, since we require this number of 
samples to ensure that we have seen, with reasonable probability, any sample from each component. 

3. If = Fj, then it is impossible to accurately estimate Wi, and in general we require at least D{Fi, Fj)) 
samples to estimate Wi, where D{Fi,Fj) denotes the statistical distance between the two distributions. 

Definition 1. We call a GMM F — WiFi e- statistically learnahle if mini Wi> e and min^^j D(Fi, Fj) > e. 

We now consider what it means to "accurately recover the mixture components" . 

Definition 2. Given two n-dimensional GMMs ofk Gaussians, F — WiM{fj,i, S^) and F = WiAf{fii, S^), 
we call F an e-close estimate for F if there is permutation function vr : [fc] — > [k] such that for all i G [k] 

1. |w, - w^(i) \ < e 

2. i5(AA(M»,S,),AA(A.(,),E,(,))) < e 

Note that the above definition of an e-close estimate is affine invariant. This is more natural than defining a 
good estimate in terms of additive errors, since in general, even estimating the mean of an arbitrary Gaussian 
to some fixed additive precision is impossible without restrictions on the covariance, as scaling the data will 
scale the error linearly. We can now state our main theorem: 

Theorem 1. Given any n dimensional mixture of k Gaussians F that is e- statistically learnahle, we can 
output an e-close estimate F and the running time and data requirements of our algorithm (for any fixed k) 
are polynomial in n, and -j. 

The guarantee in the main theorem implies that the estimated parameters are off by an additive 0{ta'^^^), 
where cr'^ax the largest (projected) variance of any Gaussian in any direction. 

Throughout this paper, we favor clarity of proof and exposition above optimization of runtime. Since our 
main goal is show that these problems can be solved in polynomial time, we make very little effort to optimize 
the exponent. Our algorithms are polynomial in the dimension, inverse of the success probability, and inverse 
of the target accuracy for any fixed number of Gaussians, k. The dependency on k, however, is severe: the 
degree of our polynomials are linear in fc. In Section [6j we give a natural construction of two GMMs F, F' of 
fc components that are each 1/fc-statistically learnahle, satisfy D{F,F') < , but F is not even a 1/4-close 
estimate of F . Thus we require an exponential in k number of samples to even distinguish these two mixtures, 
demonstrating that the exponential dependency on k in our learning algorithms is inevitable. 

Proposition. There exists two GMMs F, F' of fc components each that satisfies the following properties: 

• D{F,F') < 0(e-'=/3"). 

• F, F' are 1/k- statistically learnahle. 

• F is not a l/A-close estimate of F' . 
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1.3 Applications 

We can leverage our main theorem to show that we can efhciently perform density estimation for arbitrary 
GMMs. For density estimation — as opposed to parameter recovery — we only care to recover a distribution 
that is similar to the GMM, without worrying about matching each component; in particular, if the true 
weight of one of the components is negligible, we can simply disregard that component with negligible effect 
on the statistical distance; if two components are nearly identical in statistical distance, we can simply regard 
them as being merged into one component. For these reasons, we can perform density estimation efficiently 
without the restriction to e-statistically learnable distributions, that was required for Theorem [l] 

Corollary 2. For any n > 1, e,(5 > 0, and any n- dimensional GMM F — X]i=i^i-^i' given access to 
independent samples from F, there is an algorithm that outputs F = ^i^i such that with probability at 

least 1 — (5 over the randomization in the algorithm and in selecting the samples, D{F, F) < e. Additionally, 
the runtime and number of samples is hounded by poly{n, 1/e, 1/S). 

The proof of this corollary follows immediately from combining our main theorem, with the arguments 
in Appendix |D] In fact, an almost identical approach to how we construct the General Univariate Al- 
gorithm from the Basic Univariate Algorithm (again in Appendix [d| will work because we can run 
our main algorithm with many different parameter ranges so that most estimates are correct, and determine 
a consensus among the estimate so that we can recover a good statistical approximation to F without any 
assumptions on the mixture - not even e-statistical learnability. 

The second corollary that we obtain from Theorem [T] is for clustering. To define the problem of clustering, 
suppose that during the data sampling process, for each point a;^ e K", a hidden label yi G {1, . . . , fc} called 
the ground truth, is generated based upon which Gaussian was used for sampling. A clustering algorithm 
takes as input m points and outputs a classifier C : K" — ^ {1, . . . , fc}. The error of a classifier is the minimum, 
over all label permutations, of the probability that the permuted label agrees with the ground truth. Given 
the mixture parameters, it is easy to see that the optimal clustering algorithm will simply assign labels based 
on the Gaussian component with largest posterior probability. 

Corollary 3. For any n > 1, e, (5 > 0, and any n-dimensional e-statistically learnable GMM F — X^iLi '^iPii 
given access to independent samples from F , there is an algorithm that outputs a classifier C'p such that with 
probability at least 1 — 6 over the randomization in the algorithm and in selecting the samples, the error of 
Cp is at most e larger than the error of any classifier C . Additionally, the runtime and number of samples 
used is bounded by paly{n,l/e,l/S). 

The proof of this corollary follows immediately from our main theorem (yet here we need the assumption 
of e-statistical learnability in this case). 

1.4 Comparing Learning Two Gaussians to Learning Many 

This work leverages several key ideas initially presented in [17j which were used to show that learning mixtures 
of two arbitrary Gaussians can be done efficiently. Nevertheless, additional high-level insights, and technical 
details were required to extend the previous work to give an efficient learning algorithm for an arbitrary 
mixture of many Gaussians. In this section we briefly summarize the algorithm for learning mixtures of two 
Gaussians given in [17 , and then describe the hurdles to extending it to the general case. This discussion 
will provide insights and motivate the high-level structure of the algorithm presented in this paper, as well as 
clarify which components of the proof are new, and which are straight-forward adaptations of ideas from [17,. . 

Throughout this discussion, it will be helpful to refer to parameters ei,e2i^3^ which are polynomially 
related to each other, and satisfy ei << e2 << £3. 

There are three key components to the proof that mixtures of two Gaussians can be learned efficiently: 
the 1-d Learnability Lemma, the Random Projection Lemma, and the Parameter Recovery Lemma. The 
1-d Learnability Lemma states that given a mixture of two univariate Gaussians whose two components 
have nonnegligible statistical distance, one can efficiently recover accurate estimates of the parameters of 
the mixture. It is worth noting that in the univariate case, saying that the statistical distance between 
two Gaussians is non-negligible is roughly equivalent (polynomially related) to saying that the two sets of 
parameters are non-negligibly different, ie. the parameter distance, |/i — + jcr^ — cr'^j, is non-negligible. 
The Random Projection Lemma states that, given an n-dimensional mixture of two Gaussians which is in 
isotropic position and whose components have nonnegligible statistical distance, with high probability over 
the choice of a random unit vector r, the projection of the mixture onto r will yield a univariate mixture 
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of two Gaussians that have nonnegUgible statistical distance (say 63). The final component — the Parameter 
Recovery Lemma — states that, given a Gaussian G in n dimensions, if one has extremely accurate estimates 
(say to within some ei) of the mean and variance of G projected onto -n? sufficiently distinct directions 
(directions that differ by at least 62 >> ei) one can accurately recover the parameters of G. 

Given these three pieces, the high-level algorithm for learning mixtures of two Gaussians is straight- 
forward: 

1. Pick a random unit vector r. 

2. Pick V? vectors ri, . . . , r„2, that are "close" to r, say jr^ — r| « £2- 

3. For each i = 1, . . . , learn extremely accurate (to accuracy Ci << £2) univariate parameters Wi, /i^, ai,^[,a[ 
for the projection of the mixture onto the vector r^. 

4. Since jr^ — rj\ £2, it is not hard to show that with high probability, — « £3, \ai — aj\ << £3 
and by the Random Projection Lemma, ||(/ii,CTi) — (/i^,CT^)|| >> £3 thus it will be easy to accurately 
match up which parameters come from which component in the different projections, and we can apply 
the Parameter Recovery Lemma to each of the two components. 

Some of the above ideas are immediately applicable to the problem of learning mixtures of many Gaus- 
sians: we can clearly use the Parameter Recovery Lemma without modification. Additionally, we prove a 
generalization of the 1-d Learnability Lemma for mixtures of arbitrary numbers of Gaussians, provided each 
component has non-negligible statistical distance (which, while technically tedious, employs the key idea 
from [l7| of "deconvolving" by a suitably chosen Gaussian — see Appendix |b]) . Given this extension, if we 
were given a mixture of k Gaussians in isotropic position, and were guaranteed that the projection onto some 
vector r resulted in a univariate mixture of Gaussians for which all pairs of components either had reasonably 
different means or reasonably different variances, then we could piece together the parts more-or-less as in 
the 2-Gaussians case. 

Unfortunately, however, the Random Projection Lemma, ceases to hold in the general setting. There 
exist mixtures of just three Gaussians with significant pairwise statistical distances, that are in isotropic 
position, but have the property that with extremely high probability over choices of random unit vector r, 
the projection of the mixture onto r yields a distribution that is extremely close to a univariate mixture of two 
Gaussians. This observation would foil the approach employed in the case of just two Gaussians! Another 
difficulty is that if we take ri^ slightly different projections of our mixture of k Gaussians, then it is possible 
that in some of the projections we see what looks like a mixture oi k' < k univariate Gaussians, and in some 
other projections we see what looks like a mixture of k" univariate Gaussians. How do we match up estimates 
from projections onto different directions when the number of Gaussians in the estimate can differ? Or what 
if each projection results in an estimate that is a mixture oi k' < k Gaussians. Then how can we recover an 
n-dimensional estimate that is a mixture of k Gaussians? 

2 Outline and Definitions 

We now discuss the high-level structure of our learning algorithm, building from the intuition given in the 
preceding section. At the highest level, our learning algorithm has the following form: 
Given access to samples from a mixture of k Gaussians, 

1. Learn the parameters of some mixture oi k' < k Gaussians, where each learned Gaussian component 
roughly corresponds to one or more of the Gaussians in the original mixture. 

2. If k' < k, for each of the k' components recovered in the previous step, examine it closely and figure 
out whether it corresponds to a single Gaussian component of the original mixture, or whether it is a 
mixture of several of the original components (in which case we will then need to learn the parameters 
of these sub-components) . 

To accomplish the first step, we will require accurate parameters of the projection of each of the k' 
"clusters" of components, onto univariate projections. To do this, we employ a robust univariate algorithm 
which, given access to samples from a univariate GMM, essentially searches for some target resolution window 
(wi, W2) with wi « W2, such that the GMM is very close (wi-close) to a GMM oi k' < k statistically very 
distinct components (each pair of components is at least W2 far apart). 

Given our robust univariate algorithm, we embark on a partition pursuit where we try to find n'^ vectors 
that yield consistent and compatible univariate parameter sets-in particular, we require that each of the 

univariate projections yields parameters that satisfy three conditions: 1) they have the same number of 
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components, 2) the recovered parameters are much more precise than the distances between the projections, 
and 3) that the distance between the components is large enough so as to ensure an accurate matching of the 
components in the different projections. 

Finally, given the ability to accurately recover k' < k high-dimensional Gaussians, where each learned 
Gaussian component roughly corresponds to one or more of the Gaussians in the original mixture, we want to 
be able to examine each recovered component, and determine whether it corresponds to a single component 
of the original mixture, or a set of original components. We first claim that, with high probability, the only 
way a subset of original components will end up being grouped into a single recovered component is if the 
covariance of the mixture of that subset of components has a very small minimum eigenvalue. The existence 
of such an eigenvalue implies that we can accurately cluster the given sample points (whose covariance, recall, 
is roughly 1). Thus, given a recovered set oi k' < k parameters, we examine one of these k' components; 
if the minimum eigenvalue is sufficiently small, we project the set of data samples onto the corresponding 
eigenvector, and then partition the sample points into two clusters (provided the eigenvalue is sufficiently 
small, since the overall mixture is in roughly isotropic position, we cluster so as to almost exactly respect 
some partition of the original components). Given the set of sample points corresponding (roughly) to the 
recovered component that had small eigenvalue, we simply re-scale the data so that this subsample is now in 
isotropic position, and recursively run the entire algorithm on this rescaled subsample of the data, which, as 
we argue, consists of a mixture of k" < k components of the original mixture, with high probability. We call 
this clustering step hierarchical clustering. 

We give a detailed summary in Appendix|A]of the main elements of each of these three main components: 
the robust univariate algorithm, partition pursuit, and hierarchical clustering. 

2.1 Definitions 

Definition 3. Given two probability distributions f{x),g{x) on 3?" we can define the statistical distance 
between these distributions as 



We will also be interested in a related notion of the parameter distance between two univariate Gaussians: 

Definition 4. Given two univariate Gaussians, Fi — A/'(/^i, cr^), -F2 — A/'(/i2,cr2) we define the parameter 
distance as 



In general, the parameter distance and the statistical distance between two univariate Gaussians can 
be unrelated. There are pairs of univariate Gaussians with arbitrarily small parameter distance, and yet 
statistical distance close to 1, and there are pairs of univariate Gaussians with arbitrarily small statistical 
distance, and yet arbitrarily large parameter distances. But these scenarios can only occur if the variances 
can be arbitrarily small or arbitrarily large. In many instances in this paper, we will have reasonable upper 
and lower bounds on the variances and this will allow us to move back and forth from statistical distance 
and parameter distance, but we will highlight when we are doing so and note why we are able to assume an 
upper and lower bound on variance in that particular situation. 

As we noted, there are e-statistically Icarnable mixtures of three Gaussians that are in isotropic position, 
but for which with overwhelming probability over a random direction r, in the projection onto r, there will 
be some pair of univariate Gaussians that are arbitrarily close in parameter distance. In these cases, our 
univariate algorithm may not return an estimate with three components, but will return a mixture which has 
only two components but is still a good estimate for the parameters of the projected mixture. To formalize 
this notion, we introduce what we call an e-correct sub-division. 

Definition 5. Given a GMM of k Gaussians, F — 'Ylii'^i-^il^ii^l) (^^d, a GMM of k' < k Gaussians 
F = WiAf{jli, of), we call F an e-correct subdivision of F if there is a function tt : [fc] — > [k'] that is onto 




Dp{F^,F2)^\lJil- ^l2\ + \ol-cTl\ 



and 



lis. 
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Notationally, we will write {F, tt) G 'D^{F) as shorthand for the statement that F is an e-correct subdivision 
for F and tt is the (onto) function from k to k' that groups F into F as above. 

Note that this definition, unlike the definition for e-close estimate, uses parameter distance as opposed to 
statistical distance. This is critical because our univariate algorithm will only be able to return an estimate 
that is an e-correct subdivision when the notion of "close" is in parameter distance, and not statistical distance 
because in general there could be a component of the univariate mixture of arbitrarily small variance, and 
we will only be able to match this to an additive guarantee and this implies nothing about the statistical 
distance between our estimate and the actual component. 

3 A Robust Univariate Algorithm 

In this section, we give a learning algorithm for univariate mixtures of Gaussians that will be the building 
block for our learning algorithm in n-dimensions. Unlike in the case of |17j . our univariate algorithm will 
not necessarily be given a mixture of Gaussians for which all pairwise parameter distances are reasonably 
large. Instead, it could happen that we are given a mixture of (say) three Gaussians so that some pair has 
arbitrarily small parameter distance. 

In the case in which we are guaranteed that all pairwise parameter distances are reasonably large, we can 
iterate the technical ideas in jl7j to give an inductive proof that a simple brute force search algorithm will 
return good estimates. We call this algorithm the Basic Univariate Algorithm. From this, we build a 
General Univariate Algorithm that will return a good estimate regardless of the parameter distances, 
although in order to do so we will need to relax the notion of a good estimate to something weaker: the 
algorithm return an e-correct subdivision. 

3.1 Polynomially Robust Identifiability 

In this section, we show that we can efficiently learn the parameters of univariate mixtures of Gaussians, 
provided that the components of the mixture have nonnegligible pairwise parameter distances. We refer to 
this algorithm as the Basic Univariate Algorithm. Such an algorithm will follow easily from Theorem]!] — 
the polynomially robust identifiability of univariate mixtures. Throughout this section we will consider two 
univariate mixtures of Gaussians: 

n k 

F{x) = ^u;,A/'(/i„crf,a;), and F'{x) = ^ u;^A/'(/.t^, erf , x). 

i=l i=l 

Definition 6. We will call the pair F,F' e-standard if a^,a'^ < 1 and if e satisfies: 
1. Wt,w- e [e, 1] 

3. \fi, + \a1 -o]\>e and - | + \af - a'f | > e /or all i ^ j 

4- e < min^E., {H " + \^^^ " t^'r.{r)\ + Wf - 

where the minimization is taken over all mappings tt : {1, . . . , n} — ^ {1, . . . , fc}. 

Theorem 4. There is a constant c > such that, for any e-standard F, F' and any e < c, 

max \MAF) - MdF')\ > e^C^' 

i<2(n+fc-l) 

While the dependency on k in Theorem|4]is very bad, as we show in Section|6j this exponential dependency 
on k is necessary. Specifically, we give a construction of two 1/fc-standard distributions whose statistical 
distance is 0{e~^). 

Given the polynomially robust identifiability guaranteed by the above theorem, and simple concentration 
bounds on the i*^ sample moment, it is easy to see that a brute-force search over a set of candidate parameter 
sets will yield an efficient algorithm that recovers the parameters for a univariate mixtures of Gaussians whose 
components have pairwise parameter distance at least e: roughly, the Basic Univariate Algorithm will take a 
polynomial number of samples, compute the first Ak ~ 2 sample moments, and compare those with the first 
4fc — 2 moments of each of the candidate parameter sets. The algorithm then returns the parameter set whose 
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moments most closely match the sample moments. Theorem [4] guarantees that if the first 4fc — 2 sample 
moments closely match those of the chosen parameter set, then the parameter set must be nearly accurate. 
To conclude the proof, we argue that a polynomial-sized set of candidate parameters suffices to guarantee 
that at least one set of parameters will yield moments sufficiently close to the sample moments, which, by 
simple concentration bounds, will be close to the true moments of the GMM. We state the corollary below, 
and defer the details of the algorithm and the proof of its correctness to Appendix [Cj 

Corollary 5. Suppose we are given access to independent samples from a GMM '^'^^■^^WiJ^{^i,af,x) with 
mean and variance in the interval [1/2, 2], where Wi > e, and j/i^ — /ij | + |af — ct|| > e. There exists a Basic 
Univariate Algorithm that, for any fixed k, has runtime at most poly{\, |) samples and with probability at 
least I — S will output mixture parameters Wi, fii, di^ , so that there is a permutation tt : [fc] — > [k] and 

\m ~ w^(i) I < e, l/ij - fi-^{z) I < e, ~ o-li^^) \<e for each i^l,...,k . 

3.2 The General Univariate Algorithm 

In this section we seek to extend the Basic Univariate Algorithm of Corollary [5] to the general setting of a 
univariate mixture of k Gaussians without any requirements that the components have significant pair-wise 
parameter distances. In particular, given some target accuracy e, and access to independent samples from a 
mixture F oi k univariate Gaussians, we want to efficiently compute a mixture F' of fc' < fe Gaussians that 
is an e-correct subdivision of F. 

Proposition 6. There is a General Univariate Algorithm which, given e,6 > 0, and access to a GMM of 
k Gaussians, F = '^iWiJ\f{^i,af) that is in near isotropic position and satisfies Wi > e, will run in time 
polynomial in 1/e and l/S, and will return with probability at least I — S a GMM of k' < k Gaussians F that 
is an e-correct subdivision of F. 

The critical insight in building up such a General Univariate Algorithm is that if two components are 
actually close enough (in statistical distance), then the Basic Univariate Algorithm could never tell these 
two components apart from a single (appropriately) chosen Gaussian, because this algorithm only requires a 
polynomial number of samples. So given a target precision ei for the Basic Univariate Algorithm, there 
is some window that describes whether or not the algorithm will work correctly. If all pairwise parameter 
distances are either sufficiently large or sufficiently small, then the Basic Univariate Algorithm will 
function as if it were given sample access to a mixture that actually does meet the requirements of the 
algorithm. So as long as no parameter distance falls inside a particular window (which characterizes whether 
or not the algorithm will behave properly), the algorithm will return a correct computation. 

However, when there is some parameter distance that falls inside the Basic Univariate Algorithm's 
window, we are not guaranteed that the Basic Univariate Algorithm wiU fail safely. The idea, then, is 
to use many disjoint windows (each of which corresponds to running the Basic Univariate Algorithm 
with some target precision). If we choose enough such windows, each pairwise parameter distance can only 
corrupt a single run of the Basic Univariate Algorithm so a majority of the computations will be 
correct. We will never know which computations resulted from cases when no parameter distance fell inside 
the corresponding window, but we will be able to define a notion of consensus among these different runs 
of the Basic Univariate Algorithm so that a majority of the runs will agree, and any run which agrees 
with some computation that was correct will also be close to correct. 

We defer the algorithm and proof of correctness to Appendix [P] 

4 Partition Pursuit 
4.1 Outline 

In this section we demonstrate how to use the General Univariate Algorithm to obtain good additive 
approximations in n-dimensions. Roughly, we will project the n-dimensional mixture F onto many close- 
by directions, and run the General Univariate Algorithm on each projection. This is also how the 
algorithm in [T7] is able to recover good additive estimates in ri-dimensions. However we will have to cope 
with the additional comphcation that our univariate algorithm (the General Univariate Algorithm) 
does not necessarily return an estimate that is a mixture of k Gaussians. 

We explain in detail how the algorithm in |17j is able to obtain additive approximation guarantees in n- 
dimensions, building on a univariate algorithm for learning mixtures of two Gaussians. Let 63 >> £2 >> ei- 
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Given any e-statistically learnable mixture of two Gaussians in n-dimensions, with high probabihty, for a 
direction r chosen uniformly at random the parameter distance between the two Gaussians in Pr\F\ wih 
be at least £3. Then given such a direction r, we can choose different directions r^^y each of which are 
e2-close to r (i.e. ||r — rx^y\\ ~ £2)- The mean and variance of a component in -Pm[-F] change continuously 
as we vary the direction u from r to r^- y, and this implies that for £2 << £37 we will be able to consistently 
pair up estimates recovered from each projection, so that for each Gaussian we have n'^ different estimates 
in different directions of the projected mean and variance. Each of these estimates are accurate to within £1 
(i.e. this is the target precision that is given to the univariate algorithm). For any Gaussian, an estimate for 
the projected mean and the projected variance for a direction r gives a linear constraint on the mean vector 
/X and the co- variance matrix E. As a result, if £1 << £2 then the precision is much finer than the condition 
number of this system of linear constraints on /i, E and this yields an accurate estimate in n-dimensions. 

Lemma 7. f77| / Let £2,£i > 0. Suppose |m° - fi ■ r\,\m^^ - fi ■ r*^|, \v° - r'^Y:r\,\v^^ ~ (r*-'')'^Er*^ | are all at 
most £1. Then SOLVE outputs fi e IC and t e R"""" such that Wfi - < and ||E - < 

Furthermore, S ^ and E is symmetric. 

The algorithm to which this lemma refers is given in Appendix |F.2| 

However, the General Univariate Algorithm does not always return a mixture of k Gaussians, and 
can in fact return a mixture oik' < k Gaussians provided that this mixture is still an £i-correct subdivision 
of Pu[F] (for some direction u). But then what happens if we consider two close- by directions, u and v and 
the number of Gaussians in the estimate F^ is different from the number of Gaussians in the estimate F"7 

The key insight is that if we choose some direction r, and close-by directions r^.y, if any estimate returned 
for r^^y has more components than the estimate returned for the direction r, then we have made progress 
because we have identified another Gaussian in the original mixture F. So here, rather than trying to use 
this estimate for r^^y, we just start the algorithm over using r^^y as the original direction, and considering 
close-by directions. 

The additional complication is that we must make sure every time we see a different number of components, 
that we've made progress. We can do so by maintaining a Window from £1 to £3, and we say that a Window 
is satisfied if the estimate F^ returned for some direction r has all pairs of Gaussians either at parameter 
distance at least £3, or at most the precision £1 of the General Univariate Algorithm. Then if we 
consider close-by directions r^^y (that are £2-close to r, for £1 << £2 << £3), we can ensure that whenever we 
see a different number of components in the estimate corresponding to some direction r^ y, there are more 
components. When we see more components, we may need to shift the Window to a Window W' so that 
in this new direction r^^y, the Window W' is satisfied. We take r^^y as the new base direction. But we have 
made progress because we have identified a new component in the mixture. 

We state our main theorem in this section, and defer the algorithm and proof to Appendix [F] 

Theorem 8. Given an e-statistically learnable GMM F in isotropic position, the PARTITION PURSUIT 
Algorithm will recover an e-correct sub-division F and if F has more than one component, F also has more 
than one component. 

5 Clustering and Recursion 
5.1 Outline 

In this section, we give an efficient algorithm for learning an estimate F that is £-close to the actual mixture 
F. Partition Pursuit assumes that the mixture F is in isotropic position, and even though F is not 
necessarily in isotropic position, we will be able to get around this hurdle by first taking enough samples 
to compute a transformation that places the mixture F in nearly isotropic position and then applying this 
transformation to each sample from the oracle. The main technical challenge in this section is actually what 
to do when the mixture F returned by Partition Pursuit is a good additive approximation to F (i.e. it is 
an £i-correct subdivision with £1 << £), but is not £-close to the mixture F. This can only happen if there 
is a component in F that has a very small variance in some direction. Consider for example, two Gaussians 
in one dimension M{0, 7) and A/'(0, 7 + £1). Even if £1 is very small, if 7 is much smaller, then the statistical 
distance between these two Gaussians can be arbitrarily close to 1. 

So the high-level idea is that if the estimate F returned by Partition Pursuit is not £-close to F (but 
F is an £i-correct subdivision of F for £1 << e), then it must be the case that some component Fi of F has 
a co-variance matrix E^ so that for some direction v, v'^'SiV is very small. Then we can use this direction 
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V to still make progress: If we project the mixture F onto v, we will be able to cluster accurately. There 
will be some partition of the Gaussians in F into two disjoint, non-empty sets of components S,T and some 
clustering scheme that can accurately clusters points sampled from F into points that originated from a 
component in S and points that originated from a component in T. So we can hope to accurately cluster 
enough points sampled from F into sets of points that originated from S and sets of points that originated 
from T, and then we can run our learning algorithm (with a smaller maximum of at most fc — 1 components) 
on each set of points. By induction, this learning algorithm will return close estimates, and if we take a convex 
combination of these estimates we obtain a new estimate F' that is e-close to F. The main technical challenge 
is in showing that if there is some component of F with a small enough variance in some direction v, then we 
can accurately cluster points sampled from F. Given this, our main result follows almost immediately from 
an inductive argument. 

5.2 How to Cluster 

Here we give formalize the notion of a clustering scheme. Additionally, we state the key lemmas that will be 
useful in showing that if F is not an e-close estimate to F, then we can use F to construct a good clustering 
scheme that makes progress on our learning problem. 

Definition 7. We will call A,B C a clustering scheme if A(^B ^% 

Definition 8. For A C SR", we will write P[Fi^A\ to denote Pr^r^Fiix £ A] - i.e. the probability that a 
randomly chosen sample from Fi is in the set A. 

If we have a direction v and some component Fi which has small variance in direction v, we want to 
use this direction to cluster accurately. The intuition is clearest in the case of mixtures of two Gaussians: 
Suppose one of the components, say -Fi, had small variance on direction v. If the entire mixture is in isotropic 
position, then the variance of the mixture when projected onto direction w is 1. This can only happen if either 
the difference in projected means |f^(/<i — fi2)\ is large or the variance of F2 on direction v is large. In the 
first case, we can choose an interval around each projected (estimate) mean v'^ fii and W"^/i2 so that with 
high probability, any point sampled from Fi is contained in the interval around v'^ fii and similarly for F2. 
If, instead, the variance of F2 when projected onto v is large, then again a small interval around the point 
v"'" fii will contain most samples from Fi, but because the maximum density of v'^F2 is never large and the 
interval around f "^/ii is not too large either, most samples from F2 will not be contained in the interval. This 
idea is the basis of our clustering lemmas, although there will be additional complications when the mixture 
contains more than two Gaussians, the intuition is close to the same. 

Let {F, tt) S I?c^(F). Suppose also that F is a mixture of k' components. 

Lemma 9. Suppose that for some direction v, for all i: v'^TiiV < €2, for ei < ^^J. // there is some bi- 
partition S C [k'] s.t. i£S.j£[k']-s\v^ ('■i ~ '^'^ jj^j] ^ ^^{^ then there is a clustering scheme {A,B) (based only 
on F) so that for all i G 5,j G T^^^{i), P[Fi,A] > 1 - 63 and for all i ^ S,j G vr"i(i), Pr[Fi,B] > 1 - £3. 

This lemma corresponds to the first case in the above thought exercise when there is some bi-partition of 
the components so that all pairs of projected means across the bi-partition are reasonably separated. 

Lemma 10. Suppose that for some direction v and some i G [k'] such that: v^lliV < e„i; for £,„ >> ei. // 
there is some bi-partition S C [k'] s.t. 




(and et << e^) then there is a clustering scheme A,B such that for all i G S,j G tt ^{i), P[Fi,A\ > 1 — £3 
and for all i ^ S,j G 7r-i(i), Pr[Fi,B] > 1 - £3. 

This lemma corresponds to the second case to the second case, when there is some bi-partition of the 
components so that one side of the bi-partition has projected variances that are much larger than the other. 
The proofs of these lemmas, along with additional technical details are given in Appendix |G.2| 

5.3 Making Progress when there is a Smah Variance 

We state a lemma from |17j which formalizes the intuition that if there is no component in F with small 
variance in any direction, the F is a good statistical estimate to F: 




max(maxj^5 v' 



mm^gsw 
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Lemma 11. Suppose — fii\\ < ei, — '^iWf < ci, and — Wi\ < ci, if either jjSj """112 < ^j" or 

< ^ then 

We will use this lemma as a building block to prove: 

Theorem 12. The Hierarchical Clustering Algorithm either returns an e-close statistical estimate 
F for F , or returns a clustering scheme A, B such that there is some bipartition S C [k] such that for all 
i £ S,j G TT'^ii), P[F,,A] > 1 - 63 and for all i ^ S,j € Tr'^ii), Pr[Fi,B] > 1 - eg. And also S, [k] - S are 
both non-emtpy. 

We defer the algorithm and the proof of correctness to Appendix [G] 
5.4 Recursion 

Lemma 13. [Isotropic Projection Lemma] Given a mixture of k n-Dimensional Gaussians F — wiFi 
that is in isotropic position and is e- statistically leamable, with probability > 1 — 5 over a randomly chosen 
direction u, there is some pair of Gaussians Fi,Fj s.t. Dp{Pu[Fi], Pu[Fj]) > |g^- 

We defer a proof of this lemma to Appendix [H] 

Definition 9. Let Ha{e,S,k), Hi{e,d,k) be the inverse of the number of samples needed by the High Di- 
mensional Anisotropic Algorithm and the High Dimensional Isotropic Algorithm respectively 
when given target precision e (and access to an e-statisically leamable distribution), an upper bound k on the 
number of Gaussians, and an error parameter 5. 

We defer the algorithms to Appendix |G.4| 

Theorem 14. Given k, e, and a mixture of at most k Gaussians F that is e- statistically leamable High 
Dimensional Anisotropic Algorithm returns an estimate F that is e-close to the actual mixture F. 

Proof. We prove this theorem by induction. Let ek—i — -ffa(|j 6,k — 1). 

We assume by induction that both the High Dimensional Isotropic Algorithm and the High Di- 
mensional Anisotropic Algorithm return an e-close estimate for aU values of fc' < fc — 1. We then 
consider both algorithms for the case of k: 

Consider the High Dimensional Isotropic Algorithm which is given fc, e, and a mixture of at most 
fc Caussians F that is e-statistically learnable and is in isotropic position: We first run the Hierarchical 
Clustering Algorithm with parameters e, (5, £3, fc where = ^eek~iS. If this algorithm returns an estimate 
F", we can return this estimate and it is guaranteed to be e-close to the actual mixture. 

Note that if the number of components in F is 1, then the Hierarchical Clustering Algorithm wih 
necessarily return an estimate F, because there is no partitioning scheme that partitions F into two subsets 
of components that are both non-empty. This establishes the base case in the inductive argument. 

Otherwise the output of the Hierarchical Clustering Algorithm is a clustering scheme (A^B) 
with the property that there is some partition S, T of the Gaussians in F {S,T ^ 0) and for all i G S, 
Pfxr^Fi [xi £ A]>1 — e^, and j G T, Pr^r^Fi i^i ^ ^] ^ 1 ~ ^3- Let F5, Ft be the (re- weighted) mixtures that 
result from placing every component in S from F into Fs, and every component in T from F into Ft- Note 
that Fs,Ft are still e-statistically learnable, but may not be in isotropic position any longer. 

So we can take m — — ^ total samples Xi,X2, ■■■,Xm from SA(F). With probability at least 1 — (5: 

1. All samples xi,X2, Xm are either in A or i? 

2. The number of samples in A and the number of samples in B will each be at least j^-'-j- 

3. All samples are clustered correctly - i.e. if Xi G A, then Xi was generated by some Gaussian Fj with 
j (z S and if xi £ B, then Xi was generated some Gaussian Fj with j G T. 

Let Xs,Xt be the samples from xi,X2, ■■■,Xm that are in A,B respectively. We can then run the the 
High Dimensional Anisotropic Algorithm with parameters fc — 1 on each set Xs and Xt- Let 
the algorithm return the mixtures Fa,Fb respectively. We return a convex combination of these mixtures, 
^ ~ ^^m^-^-^ ^^wT^^- '^^^ estimates Fa,Fb are e-close estimates to Fs,Ft respectively. We can write 
F = waFa + wbFb, and with high probability ■'^p, will be close to wa,wb respectively. Then 
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this implies that F is e-close to F. Thus by induction, the output of the High Dimensional Isotropic 
Algorithm is an estimate F that is e-close to F. 

We need to also verify by induction that the output of the High Dimensional Anisotropic Algorithm 
is also an e-close estimate to F. So suppose that the input to the High Dimensional Anisotropic 
Algorithm is a mixture of at most k Gaussians, that is e-statistically learnable and is not necessarily in 
isotropic position. 

We let Cfe = (5ifa(f , 5, k). Then if we take m = O ( " ^ ) samples xi^X2, Xm, compute the transforma- 
tion T that places these samples in exactly isotropic position, and run the High Dimensional Isotropic 
Algorithm with the sample oracle T'(SA(F)), parameters ^,6,k. Using the above section, and the induction 
hypothesis. High Dimensional Isotropic Algorithm outputs an e-close estimate for all values of k' < k. 
The input sample oracle T{SA{F)) is not exactly in isotropic position, but there is another mixture F' which 



is in exactly isotropic position, that is |-close to F and for which D{F, F') < e^. using Theorem 56 Since 
the High Dimensional Isotropic Algorithm wiU only take Ha{^,S,k — 1) samples, with probability at 
least 1 — ^ we can assume that all these samples come from F' , which implies (by induction) that the output 
will be an estimate F that is |-close to F' , which means that F is also e-close to F, as desired. □ 



6 Exponential Dependence on k is Inevitable 

In this section, we present a lower bound, showing that the inverse exponential dependency on the number 
of Gaussian components in each mixture is necessary, even for mixtures in just one dimension. We show 
this by giving a simple construction of two 1-dimensional distributions, Di,D2 that are l/(2m)-standard. 
Specifically, they are mixtures of at most m Gaussians, such that the weights of all components of each 
mixture are at least l/(2m), and the parameter distance between the pair of distributions is at least 1/(2to), 
but ll-Di — D2\\i < 6"™/^°, for sufficiently large m. The construction hinges on the inverse exponential (in 
k « \/rn) statistical distance between A/'(0, 2), and the mixtures of infinitely many Gaussians of unit variance 
whose components are centered at multiples of 1/fc, with the weight assigned to the component centered at 
i/k being given by N(0, l,i/k). Verifying that this is true is a straight-forward exercise in Fourier analysis. 
The final construction truncates the mixture of infinitely many Gaussians by removing all the components 
with centers a distance greater than k from 0. This truncation clearly has negligibly small effect on the 
distribution. Finally, we alter the pair of distributions by adding to both distributions, Gaussian components 
of equal weight with centers at —k'^/k,{—k'^ + l)/k,...,k, which ensures that in the final pair of distributions, 
all components have significant weight. 

Proposition 15. There exists a pair Di,D2 ofl/{Ak'^ + 2)-standard distributions that are each mixtures of 
k'^ + 1 Gaussians such that 

\\D,-D2\\i<nke-'^'^'\ 

We can define Ffe(x)^ = Cfc '£f=-N ^^"'^''/''''^ Af{i/k, 1/2, x)„ and we give a plot of F^ for fc = 2, 2 
in Figure|6^ and the corresponding plot of each component, and in FigurejHjs we give a plot of each component 
of for /fc = 4, A = 8. 

We defer the details to Appendix [E| 



7 Conclusions 

We give an estimator that converges to the true distribution at an inverse polynomial rate, and this result 
has implications for polynomial-time clustering and density estimation. A natural question is: "What is the 
optimal rate of convergence?" This question is wide open, and all we can say for certain is that the rate of 
convergence is at worst polynomial in the dimension and the inverse of the desired accuracy, and exponential 
in the number of components. We made no attempt here to optimize the constants in the exponent of 
the rate of convergence and even if we had, the theoretical runtime would still be extremely impractical. 
This, however, raises the practically relevant question of whether aspects of our algorithm can be combined 
with existing heuristics that seem to perform well in most applications. For example, the brute- force-search 
component of our univariate algorithm is clearly expensive; perhaps employing existing heuristics (such as the 
EM algorithm) for the univariate problems, in conjunction with aspects of our dimension-reduction machinery 
might yield improved efficiency on real-world instances. 
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Figure 1: a. for A; = 2, iV = 2 (solid) and AA(0, 1) (dashed) b. for A: = 4, iV = 8 



Additionally, we note that much of the machinery we developed — from the "deconvolution" argument for 
the polynomially robust identifiability, to the partition pursuit and hierarchical clustering for the dimension 
reduction arguments, seem to be relatively general and robust. We suspect that such tools could be applied 
to yield corresponding results for other families of distributions. 
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A In-Depth Outline 



A.l A Robust Univariate Algorithm 

To start, suppose that we are given access to independent samples from a mixture of k Gaussians, and given a 
unit vector r with the following promise: for each pair Gi, G2 of components, in the projection of the mixture 
onto r, either the projections of Gi and G2 have reasonably different parameters (> 62), or their projections 
arc so close that our algorithm could never tell them apart from a single Gaussian (parameter distance at 
most eg << ei, where ei << £2 is the desired accuracy of the 1-d parameter learning algorithm. In this case, 
our 1-d parameter recovery algorithm will perform correctly, and return some ei-accurate parameters for a 
mixture of fc' < fc components. 

Thus in general, for a given desired accuracy ei, there is some critical window, namely [eo,e2], associated 
with the 1-d learning algorithm that determines if it will function correctly. In a given projection, as long as 
no pair of components have parameter distances that fall within this window, then any pair of Gaussians is 
either reasonably different in parameters, or so close in parameters that the algorithm will never be able to 
tell the difference. 

In this way, if an algorithm designer is told the parameters of a given mixture of k Gaussians, he could 
construct an algorithm that would have been able to find some of these parameters. The algorithm would 
project onto a random direction r, and based on the pairwise parameter distances between the univariate 
Gaussians, there will be some window (i.e. some choice of a target precision with which to run the algorithm), 
bounded below by some polynomial in the desired output accuracy, so that the algorithm would function 
correctly. The problem is that while there is always some window that would work for any mixture of k 
univariate Gaussians, we don't know what window to use, and in general if we run the algorithm on a bad 
window, we aren't guaranteed that the algorithm will fail in a safe way. 

To get around this, we run the 1-d Learning Algorithm algorithm many times on different windows that 
do not intersect. Because there are only k univariate Gaussians, and thus at most fc^ different distances 
between component parameters in any given projection, at most k^ of these windows can be corrupted. If 
we choose sufficiently many windows (but still a polynomial number), a majority of the windows will yield 
correct parameters. Even though we can never determine which windows were good and which were bad, 
we can return the parameters generated by some window in consensus with a majority, and in this way, 
regardless of whether the window was good or bad, it is in consensus with a good window and must also be 
close to the correct parameters. 

It is important to stress that even after the above consensus is conducted on a given projection, we still 
cannot be guaranteed that our univariate algorithm returns a mixture of k Gaussians. Instead, it will return 
some mixture oi k' < k Gaussians, where an element in the mixture might correspond to (say) a pair of 
Gaussians in the original mixture that were too close to differentiate in the given projection. 



A. 2 Partition Pursuit 



This brings us to the second obstacle outlined in Section 1.4 in order to recover the n-dimensional parameters, 
we will need estimates of the parameters of the Gaussians when projected on many different directions. But, 
as mentioned above, the univariate algorithm will not necessarily return a mixture of k Gaussians, and even 
if we choose a direction r' that is sufficiently close to r (but still \r — r'\ » ei, the accuracy of the 1-d 
algorithm), it may be the case that the univariate algorithm for direction r returned k' Gaussians, and the 
univariate algorithm for direction r returned k" 7^ k' Gaussians. How do we pair up these estimates in a 
consistent manner? 

The key insight is that we are actually making progress if we see more Gaussians when projecting onto a 
different direction. If we choose a new direction, and we see a mixture of Gaussians with more components, 
we should backtrack and start over as if this was the direction we originally chose. We may have to slide the 
window corresponding to our 1-d algorithm and learn at a finer precision than what we chose previously, but 
this finer precision will still be polynomially bounded. Effectively, we are clustering the Gaussian components 
into clusters with the property that the components of each cluster are indistinguishable in each of the one- 
dimensional projections that we have considered. In order to make this idea work properly, we also need to 
ensure that we maintain a minimum parameter distance between all Gaussians clusters that we have seen (i.e. 
this distance is much larger than our 1-d accuracy ei), so that when we choose a new direction r' sufficiently 
close to r, Gaussian component cannot switch clusters. Thus at each stage, each cluster of Gaussians either 
continues to be a cluster, or it gets partitioned into several clusters of Gaussians. 
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A. 3 Hierarchical Clustering 



The final obstacle outlined in Section 1.4 can be addressed easily via an accurate clustering of the input 
samples together with a fc-Gaussian analog of the Random Projection Lemma. Intuitively, the only way that 
a set of high-dimensional Gaussians with significant statistical distance, when projected onto a random vector, 
will appear nearly identical is if the re- weighted mixture of the Gaussians in this set is very far from isotropic 
position. This motivates the hope that if we have recovered some mixture of fc' < fc components, then it 
must be the case that whichever of these components contains multiple original Gaussians has covariance 
matrix very far from isotropic. Thus such a component must have at least one very small eigenvalue. Given 
the eigenvector corresponding to such an eigenvalue, we should be able to very accurately cluster the sample 
points into some partition of the original Gaussians. This motivates the following slightly more specific 
version of the high-level algorithm approach: 

Given that we have recovered parameters for a mixture oi k' < k components: 

1. Learn the parameters of some mixture oi k' < k Gaussians, where each learned Gaussian component 
corresponds to one or more of the Gaussians in the original mixture. 

2. If k' < k, for each of the k' components recovered in the previous step: 

• If the i*"^ component has covariance matrix "not too far" from isotropic, then conclude that it 
corresponds to a single Gaussian in the original mixture. 

• Else: 

(a) there is a very small eigenvalue of the covariance matrix, so project the sample points onto 
the corresponding eigenvector, and accurately cluster the sample points that come from this 
component 

(b) Given the sample points corresponding to one of the components, rescale these data points so 
this component (which was very far from isotropic), is now in isotropic position, and repeat 
the entire algorithm on this sub-mixture 

The final observation that guarantees that our algorithm will make progress with every iteration, and 
thus terminate after a polynomial number of steps is the following analog of the Random Projection Lemma 
for the fc-Gaussians setting. Given a mixture of k Gaussians in isotropic position, with high probability 
over random unit vectors r, there will be some pair of projected Gaussians whose parameters are reasonably 
different. Thus, in every projection, we will, with high probability, see what appears to be a mixture of at 
least two components. 



B Polynomially Robust Identifiability 
B.l Outline 

We now sketch the rough outline of the proof of Theorem |4] While there are considerable technical details, 
the main proof ideas are identical to those used in [T7] to prove the analogous theorem in the case that 
n = fc = 2. 

Our proof will be via induction on max(n, k). We start by considering the constituent Gaussian of minimal 
variance in the mixtures. Assume without loss of generality that this minimum variance component is the 
first component of and denote it by Ni. If there is no component of F' whose mean, variance, and mixing 
weight very closely matches those of A'^i , then we argue that there is a significant disparity in the low order 
moments of F and F', no matter what the other Gaussian components are. (This argument is rather involved, 
and we will give the high-level sketch in the next paragraph.) If there is a component N[ of F' whose mean, 
variance, and mixture weight very closely matches those of TVi, then we argue that we can remove A'^i from 
F and N[ from F' with only negligible effect on the discrepancy in the low-order moments. More formally, 
let H be the mixture oi n — \ Gaussians obtained by removing A^i from F , and rescaling the weights so as 
to sum to one, and define H' , a mixture of — 1 Gaussians analogously. Then, assuming that iV and N' 
are very similar, the disparity in the low-order moments of H and H' is almost the same as the disparity in 
low-order moments of F and-F'. We can then apply the induction hypothesis to the mixtures H and H' . 

We now return to the problem of showing that if the skinniest Gaussian in F cannot be paired with 
a component of F' with similar mean, variance, and weight, that there must be a polynomially-significant 
discrepancy in the low-order moments of F and F' . This step relies on 'deconvolving' by a Gaussian with an 
appropriately chosen variance (this corresponds to running the heat equation in reverse for a suitable amount 
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of time) . We define the operation of deconvolving by a Gaussian of variance a as J- a ; applying this operator 
to a mixture of Gaussians has a particularly simple effect: subtract a from the variance of each Gaussian in 
the mixture (assuming that each constituent Gaussian has variance at least a). If a is negative, this is just 
convolution. 

Definition 10. Let F{x) — <jf, x) be the probability density function of a mixture of Gaussian 

distributions, and for any a < mini erf, define 



The key step will be to show that if the skinniest Gaussian in either of the two mixtures cannot be paired 
with a nearly identical Gaussian in the other mixture, then there is some a for which the resulting mixtures, 
after applying the operation J^, have large statistical distance. Intuitively, this deconvolution operation 
allows us to isolate Gaussians in each mixture and then we can reason about the statistical distance between 
the two mixtures locally, without worrying about the other Gaussians in the mixture. 

Given this statistical distance between the transformed pair of mixtures, we the fact that there are 
relatively few zero-crossings in the difference in probability density functions of two mixtures of Gaussians 
(Proposition[l9|) to show that this statistical distance gives rise to a discrepancy in at least one of the low-order 
moments of the pair of transformed distributions. To complete the argument, we then show that applying 
this transform to a pair of distributions, while certainly not preserving statistical distance, roughly preserves 
the combined disparity between the low-order moments of the pair of distributions. The complete proof can 
be found in Appendix [B| 

B.2 Theorem H 

In this section we give the complete proof of the polynomially robust identifiability of univariate mixtures of 
k Gaussians (Theorem [4]) . For convenience, we restate the theorem and all necessary definitions. We make 
frequent reference to the simple properties of Gaussians and tail bounds provided in Appendix [j] Throughout 
this section we will consider two univariate mixtures of Gaussians: 

n k 

F{x) = ^u;,AA(//„af,a;), and F'{x) = Y,<^(A,<^?,^)- 

2 = 1 1=1 

Definition [6| Wc will call the pair F, F' e-standard if erf, cr^^ < 1 and if e satisfies: 
1. Wi,w'^ e [e, 1] 

3. \fi, + \a1 -<7]\>e and - | + \c7? - af\ > e for aU i ^ j 

4. e < min^^,, {^w, - w'^^^^\ + |m» - fi'^(^,)\ + \<yf - 
where the minimization is taken over all mappings tt : {1, . . . , n} — ^ {1, . . . , fc}. 

Theorem [4[ There is a constant c > such that, for any e-standard F,F' and any e < c, 

max IMAF) - AU(F')\ > e^C^^ 

i<2(n+fe-l) 



The following definition of the deconvolution operation will be central to our proof of Theorem |4] 



Definition 10, Let F{x) ~ J27=i WiN{p,i, erf, 
distributions, and for any a < min^ af, define 



be the probability density function of a mixture of Gaussian 



The following lemma argues that if the skinniest Gaussian in mixture F can not be matched with 
a sufficiently similar component in the mixture F', then there is some a, possibly negative, such that 
maxx\Ta{F){x) — Fa{F'){x)\ is significant. Furthermore, every component in the transformed mixtures 
have variances that are not too small. 
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Lemma 16. Suppose F, F' are e-standard. Suppose without loss oj generality that the Gaussian oj minimal 
variance is M{^,i,<j\), and there is some 7 satisfying e/4 > 7 > such that for all i, at least one of the 
following holds: 

• Imi > 7* 

• |a?-af|>7« 

• \wi — w'i\ > 7. 

Then there is some a such that either 

• max,(|J-,(F)(x) - > 
J^a{F') is at least 7^, 

or 

• max,(|J-,(i^)(x) - To,iF'){x)\) > 
TaiF') is at least 7^8 . 



and the minimum variance in any component of J- a {F) or 



and the minimum variance in any component of Fa{F) or 



Proof. We start by considering the case when there is no Gaussian in F' that matches both the mean 
and variance to within 7®. Consider applying F„2_^ia. F„2_^w{F){p.i) > eA/'(0, 7^^, 0) — 0^/2^ - Next, by 
Corollary [60) ^ 

T,2.^^siF'){fi^) < ^ 



7° V27re 



and thus 



T„2_^^s{F){^,,) - T,2^^^siF')ip^) 



> 



Next, consider the case where we have at least one Gaussian component of F' that matches both and 
af to within 7*, but whose weight differs from wi by at least 7. By the definition of e-standard, there can 
be at most one such Gaussian component, say the i*'*. If wi > w[, then F„2_^i{F){iii) — J'^2_^4 (F')(^i) > 
— ^ H where the second term is a bound on the contribution of the other Gaussian components, using 

7V27r €V27re ' ^ 

the fact that F, F' are e-standard and Corollary 60 Since 7 < e/4, this quantity is at least ^-yl/^^ ' 

If wi <w[, then consider applying F„2_^i to the pair of distributions. Using the fact that ^^^^^ — '^^^ 
we have 



T,2_,.{F'M) - T,2,^4F){t,',) > 



> 



> 



> 



1-7^ 



(wi + 7) 



7V27r 



ev27re 



1 



-.{wi -t- 7) - 

2tt e\/27re 



Wl 



ev27re 



1 



1 

27V27r' 



□ 

Claim 17. Let fix*) > M for some x* € (0, r) and suppose that f{x) > on (0,r) and /(O) ~ f{r) = 0. 
Suppose also that \ f'(x)\ < m everywhere. Then f[x)dx > 

Proof. Consider the continuous function g{x) that is defined to be for a; g [0, x* — M/m] U [x* + M/m, r], 
and has slope m on the interval (a;* — M/m,x*), and slope —m on the interval {x*,x* + M/m). Clearly 
f{x) > g{x) for X G (0,r), and thus 



f{x)dx > / g{x)dx = M^m. 



□ 



The above claim together with Lemma 16 yields the following 
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Corollary 18. For a,^ as defined in Lemma [T^ 

D{J^^{F){x),:F^{F'){x))>n{j^^). 

Proof. Let f{x) = Fa{F){x),Fa[F')[x), then /(x*) > M for M ~ ^[^j) and for some x* contained in an 
interval / in which f{x) does not change sign. Similarly, because the minimum variance in any co mp onent 
of J^a{F) or Ta{F') is at least 7^^, this implies that f'{x) — 0(::^) = m. So we can apply Claim jl?] using 
m, M and get that Jj f{x) > and this implies the corollary. □ 

We now show that the poly(e) statistical distance between Fa{F) and Fa{F') gives rise to a poly(e) 
disparity in one of the first 2(fc + 71 — 1) raw moments of the distributions. To accomplish this, we show 
that there are at most 2(fc + n — 1) zero-crossings of the difference in densities, / = Fa{F) — Ta{F'), using 
properties of the evolution of the heat equation, and construct a degree 2{k + n — 1) polynomial p{x) that 
always has the same sign as /(x), and when integrated against f{x) is at least poly(e). We construct this 
polynomial so that the coefficients are bounded, and this implies that there is some raw moment i (at most 
the degree of the polynomial) for which the difference between the i*'' raw moment of J-q(F) and of Ta{F') 
is large. 

We use the following proposition from [T7] that shows that Fa{D){x) — Fa{D'){x) has few zeros. 

Proposition 19. [Prop. 1 from [171 .] Given f{x) = "^^^^OiAfdiijaf^x), the linear combination of m 
one- dimensional Gaussian probability density functions, such that af ^ cr| for i ^ j , assuming that not all 
the Oi 's are zero, the number of solutions to f{x) = is at most 2{m — 1). 

Lemma 20. Suppose that D{F,F') > $7(7^*) and that the minimum variance in any component of F,F' is 
at least 7^^ and also let F, F' be mixture of n and k Gaussians respectively, and the mean of each component 
of F and F' is at most ^. Then there is some moment i £ [2(n + fc — 1)] s.t. \Ep[x^] — Ep'[x'^]\ > ^^(7'^) for 
some constant c = c{n, k) that depends on n,k. 



Proof. Using Proposition 19 there are at most 2(7i + fc— 1) zero crossings of the function f{x) = F{x) — F'{x). 



Consider the interval / = Using Corollary 62 the contribution to of — / is at most 



0(7 ^), and for sufficiently small 7, this is negligible. 

Because D{F, F') > 51(7^*) and the fact that there are at most 2(n + fc — 1) zero crossings of the function 

f{x), there must be some interval J for which f{x) does not change signs and jj \f{x)\dx > If we 

choose = ztXI^. (a; — z^) for all zeros Zi e /. We can then choose signs so thatp(a;) matches f{x) on JUI — 

J'. Then Jj,p{x)\f{x)\dx>\jjp{x)f{x)dx\-j^_j\p{x)f{x)\dx>\jjp{x)f{x)dx\-0{j-'-^^^^^ 

because each coefficient in p{x) is bounded by ^2(nlk-i) ■ Let J" C J be the interval [a — S,b + S] C J = [a,b]. 

Then \Jj„p{x)f{x)dx\ > | and \Jj„f{x)dx\ > \Jjf{x)dx\ - 0{^) because the 

derivative of f{x) is bounded by 0(^18), and f{a) = f{b) = 0. So choosing S = 0(7^^) yields that 
I Jj„ f{x)dx\ > ri(7^*) (where the constant hidden in 0{) depends on n, fc). 

So this implies that \ Jj„ p{x)f{x)dx\ > r2(7'^("+'=~^)) for some constant c (that does not depend on 
n,k. Using the fact that the coefficients of p{x) are bounded by ^2(m+k-i) , this implies that there is some 
i e [2{n + fc — 1)] such that | Jj,, x^f{x)dx\ > ^1(7"^ (n+fc-i)") fQj. gome constant c' that does not depend on 
n, fc. 

Then using the bound of 0(7^*^^("+'^^^)e ^) for E!ji^i[p{x)f{x)], for sufficiently small 7 this implies 
that \Ef[x'] - Ef'[x']\ > f7(7'="'("+'=-i)) □ 

Unfortunately, the transformation J-a does not preserve the statistical distance between two distributions. 
However, we show that it, at least roughly, preserves (up to a polynomial) the disparity in low-order moments 
of the distributions. 

Lemma 21. [Lemma 6 from \1 ?] /■/ Suppose that each constituent Gaussian in F or F' has variances in the 
interval [a, 1]. Then 

k k 

^ |M, (J-„(F)) - [T^iF')) I < I^^(^) - 

i=l L / J- 
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The proof of the above lemma follows easily from the observation that the moments of F and J-a{F) 
are related by a simple linear transformation, which can also be viewed as a recurrence relation for Hermite 
polynomials. 

We now put the pieces together: 
Proof of Theorem^ The base case for our induction is when n = k = 1, and follows from the fact that given 
parameters /i, /i', cr^, cr'^, such that cr^, ct'^ < 1, and |/i — + jcr^ — (t'^| > e, then one of the first two moments 
oi Af{fi, a^) differs from that of N{p' , cr'^) by at least e/2. 

For the induction step, assume that for all pairs of e-standard mixtures of n, and k Gaussians, respectively, 
one of the first 2(n+fc — 1) moments differ by at least /(e, n + k). Consider e-standard mixtures F, F' , mixtures 
of n', k' Gaussians, respectively, where either n' = n + 1, or fc' = fc + 1, and either n' = n or k' = k. Assume 
without loss of generality that crj is the minimal variance in the mixtures, and that it occurs in mixture F. 

We first consider the case that there exists a component of F' whose mean, variance, and weight match 
IJi,af,wi to within an additive x , where x is chosen so that each of the first 2 (n + fc — 1 ) moments of any pair of 
Gaussians whose parameters are within x of each other, differ by at most /(e/2, n+fc— 1)/2; specifically, letting 
q{y) be the polynomial (dependent on n, k) of Lemma 63 bounding the discrepancy in the first 2{n + k — 1) 
moments of Gaussians whose parameters differ by y, we set x so that q{x) — /(e/2, n + fc — l)/2. Note that 



for fixed n, fc, x will be polynomial in e. Since Lemma 63 requires that af > \fx^ , if this is not the case, we 



convolve the pair of mixtures by A/'(0, e), which by Lemma 21 changes the disparity in low-order moments by 
a polynomial amount, and proceed with the chosen value of x and the transformed pair of GMMs. 

Now, consider the mixtures H,H', obtained from F^F' by removing the two nearly-matching Gaussian 
components, and rescaling the weights so that they still sum to 1. The pair H,H' will now be mixtures of 
fc' — 1 and n' — 1 components, and will still be (e — e^)-standard, and the discrepancy in their first 2(n' + fc' — 1) 
moments is at most /(e/2,n + fc — l)/2 different from the discrepancy in the pair F,F'. By our induction 
hypothesis, there is a discrepancy in one of the first 2{n' + k' — 3) moments of at least /(e/2, n -I- fc — 3) and 
thus the original pair F^ F' will have discrepancy in moments at least half of this, which is still poly{e), for 
any fixed n, fc. 

In the case that there is no component of F' that matches /ii, ct^, Wi, to within the desired accuracy x, we 
can apply Lemma |16| with j = x, and thus by Lemma |20| there exists some a such that in the transformed 



mixtures Fa{F), Fa{F'), there is a,poly{x) = poly(e) disparity in the first 2(fc+rt— 1) moments. By Lemma 21 



this disparity in the first 2(fc-|-n— 1) moments is polynomially related to the disparity in these first moments 
of the original pair of mixtures, F, F' . □ 



C The Basic Univariate Algorithm 

In this section we formally state the Basic Univariate Algorithm, and prove its correctness. In particular, 
we will prove the following corollary to the polynomially robust identifiability of GMMs (Theorem |4|. 
Corollary [Sj Suppose we are given access to independent samples from a GMM 

k 
i=l 

with mean 1 and variance in [1/2,2], where Wi > e, and \fii — Mj| + \<jf ^ — ^- The Basic Univariate 
Algorithm, for any fixed fc, has runtime at most polyk{\, \ ) samples and with probability at least 1 — 5 will 
output mixture parameters Wi, (li^di^ , so that there is a permutation tt : [fc] — )■ [fc] and 

\wi - WT,(i) I < e, l/Zi - I < e, - 0-^(^-1 1 < e for each i = 1, . . . , fc . 



Our proof of the above Corollary will consist of three parts; first, we will show that for any a < e, a there 
is some polynomial p such that p{a^e) samples suffices to guarantee that with probability at least 1 — (5, the 
first 4fc — 2 sample moments will all be within a of the corresponding true moments. Next, we show that it 
suffices to perform brute-force search over a polynomially-fine mesh of parameters in order to ensure that at 
least one point (wi,^i,(fi^, . . . , Wfc, /ffc, ct/c^) in our parameter-mesh will have the first 4fc — 2 moments that 
are each within a from the true moments. Finally, we will use Theorem |4] to conclude that the recovered 
parameter set (/i"i, (fi^, . . . , /Xfe, dk^) must be close to the true parameter set, because the first 4fc — 2 moments 
nearly agree. We now formalize these pieces. 
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Algorithm 1. BASIC Univariate Algorithm 

Input: k , e , a < e, S, sample oracle SA{F), where F — ^^'WiAf{fJ.i, (7^) is a mixture 
of k Gaussians, where the mixture has mean is and variance at most 2, and whose 
components have weights and pairwise parameter distances are at least e. 
Output: (wi, fiijdi^ , . . . ,Wk, fikjcfk^), s.t. with probability at least 1 — 5 over the 
random samples, satisfies 

k 

a > miny^ (\wi — Wi\ + — fli \ + [af — (7i^|) . 

7r — ^ 

i = l 

1. Set a < e^''^^'^ , where the 0{k) is 2k more than the exponent is from Theorem [4]. 

2. Take a€~^'^S~^ samples from SA{F), and compute the first 4k — 2 sample moments, 
mi, . . . ,m4fc_2. 

3. Let 7 = 0(a'*'''"^), and we will iterate through the entire set of candidate 
parameter vectors of the form F — {wi, fli, di , . . . ,Wk, llk,dk ) satisfying: 

• All the elements are multiples of 7, 

• Wi > e/2, and J2i 'Wi = 1 

• each pair of components has parameter distance at least e/2. 

• < 2/e. 

4. Compute the first 4fc — 2 moments of mixture F, rrii, . . . ,mi'k-2 ■ 

5. If for all i G {1, . . . ,4fc - 2}, |mi - rhi\ < a, then RETURN F. 



Figure 2: The Univariate Algorithm. 



Lemma 22. Let Xi, X21 ■ ■ ■ , be independent draws from a univariate GMM F that is in isotropic position, 
and each of whose components has weight at least e. With probability > 1 ~ S, 



m ^ — ' 

1=1 



< 



1 



0{e 



-2k\ 



where the hidden constant on the big- Oh depends on k. 
Proof. By Chebyshev's inequality, with probability at most (5, 



1 




We now bound the right hand side. Clearly, ^xi....,xr„ X^I^i ~ Eaj^Fi^c*^]] = 0. Using the fact that the 
variance of a sum of independent random variables is the sum of the variances, 



E 



1=1 



-E-r 



To conclude, we give a very crude upper bound on the j*'' moment of F\ since F is in isotropic position and 
each Gaussian component has weight at least e, the mean and variance of each component has magnitude at 
most 1/e. Thus E2;^i?[x^] can be bounded by (2/e)-' + Tj, where Tj > maxCT2<i/c A/'(/i, cr^, x)da;^ , 

which, by Corollary 62 is at most 0{l/e^), from which the lemma follows. □ 

We now argue that a polynomially-fiiie mesh suffices to guarantee that there is some parameter set in our 
mesh whose first 4fc — 2 moments are all close to the corresponding true moments. 
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Lemma 23. Given a univariate mixture F of k Gaussians centered at with variance at most 2, each of 
whose weights are at least e, such that each pair of components has parameter distance at least e, and a target 
accuracy a < e, there exists a "f = poly{a), and set of parameters /ii, (fi^, . . . , Wfe, /ife, (7^^) such that each 
parameter is a multiple ofj, each is bounded by 2/e, each weight is at least e/2, each pair of components has 
parameter distance at least e/2, and the first 4/c — 2 moments of F are within a of the corresponding moments 
of F, the mixture corresponding to the recovered parameters. 

Proof. Consider the parameter set obtained by rounding the true parameter set, excluding the weights, to 
the nearest multiple of 7. For each weight Wi, we set Wi to be either the multiple of 7 just above, or just 
below Wi, ensuring that ~ which can clearly be down. That the rounded mixture has component 

weights at least e/2, pairwise parameter distances at least e/2, and values bounded in magnitude by 2/e is 
obvious. We now analyze how much the rounding has effected the moments. 



From Claim 65 the i*'' moment of each component is just some polynomial in /i, cr^, which is a polynomial 
of degree at most z, with coefficients bounded in magnitude by (i + 2)! Thus changing the mean or variance 
by at most 7 will change the i*'* moment by at most 

(i + 2)\i {{2/e + 7)* - (2/e)^) < {i + 2)!i(2/e)* ((1 + ^e/2y - l) < {i + 2)\i{2/ e)' {i-fe) = i^{i + 2y.Te-'+^j. 

Thus if we used the true mixing weights, the error in each moment of the entire mixture would be at 
most k times this. To conclude, note that for each mixing weight \wj — Wj\ < 7, and since, as noted 
in the proof of the previous lemma, each moment is at most 0(e~*) (where the hidden constant depends 
on i), thus the rounding of the weight will contribute at most an extra 0(7e~'). Adding these bounds 
together, we get that each of the first 4fc — 2 moments of F can be off from the true ones by at most 
fc(0(7e-4'=+2) + 2(4fc - 2)2(4fc)!e-'*'=+37 = 0{je'^''+'^), where the hidden constant depends on k. Thus 
letting 7 = CfcCk^'^"^, where the constant c/. depends on k suffices to ensure that all moments are within a of 
their true values. □ 

We now piece together the above two lemmas to prove Corollary [5) 
Proof of Corollary ^ Given a desired moment accuracy a < e, by applying a union bound to Lemma |22[ 
0{ae~^'^5~'^) samples suffices to guarantee that with probability at least 1 — S, the first 4fc — 2 sample moments 



are within a from the true moments. Thus with at least probability 1 — (5, by Lemma 23 our polynomial 
mesh of parameters suffices to recover a set of parameters (wi, /Ii, cfi^, . . . ,Wk, flk, '^k'^) whose weights and 
pairwise parameter-distances are at least e/2, and whose first 4fc — 2 sample moments will all be within 2a 
from the sample moments, and hence within 3a from the true moments. 

To conclude, note that the pair of mixtures F,F, after rescaling by at most (e/2)^/^ so as to ensure each 
component in the mixture has variance at most 1 (which scales the fc*'' moments by {e/2)^^'^), satisfies the 
first three conditions of being e/2-standard, and thus, if the first 4fc — 2 moments (after rescaling) agree to 
within (e/2)2'=-i • {e/2)0^^\ Theoremji] guarantees that the recovered parameters must be accurate to within 
e (where the first 0{k) in the exponent is from Theorem [4]). Thus setting 3a < {e/2)'^^^^ = polyk{e) will 
ensure that with the desired high probability, the recovered parameters are e/2 accurate. □ 



D The General Univariate Algorithm 

D.l Composing Subdivisions 

Lemma 24. Suppose that F , G and H are GMM of ki < ^2 < ^3 Gaussians respectively. If {G, tti) G ^^{F) 
and {H,TT2) G V,{G), then (iJ,7r2 7ri) G Voik,)e{F). 

Proof. Note that tti : [fci] — > [^2] and 7:2 '. [^2] ^ [^3]- Consider : [ki] — >■ [k^] ~ 1:2 ° i^i- This function 773 
is onto, because both tti and 7r2 are both onto. 

Also consider any j G 7r^^(/i) (for some h G [fcs]). In fact, let i G 7r2"^(/i) and j G iTi^{i). Then because 
parameter distance is a distance (i.e. satisfies triangle-inequality): 

Dp{Fj,Hh) < Dp{F,,Gi)+Dp{G,,Hh) < 2e 

because (G, tti) G 'D^{F) and {H,tt2) € T^e{G) and 7r2(i) = h and 7Ti{j) = i. We write Wj' for the weight of 
the j*'* component of F to simplify notation, and similarly for G, H. Then using this notation: 

I E <-^fl< E I E + l E wf-w!'\<k2e + e<{k, + l)e. 
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□ 



Fact 25. 



\\M{fi,<j')-U{fi,a^{l + S))\U < 105 
\\Af ifi, a^) -J\f{n + aS, <j^)\\i < 106 

Corollary 26. If af = 9(1), then 

D.p[N{tiu'yl),M{ii2,<yl)) = &[d{N{ii^,<jI),N{m.<jI))) 

Claim 27. Convolving two Gaussians Fi, F2 by the same Gaussian a"^) preserves the parameter distance 
between Fi and i^2- Also, given an estimate Fi which is within D in parameter distance from Af o Fi, by 
subtracting jjb from the mean of Fi and from the variance of Fi, we obtain an estimate for Fi which is 
within D in parameter distance from Fi . 

Lemma 28. Suppose {F,tt) e 'Dg{F) and that each Gaussian Fi in the mixture F has variance at least ^. 
Then D{F,F) < 0{k')e, where k' is the number of components in the GMM F. 

Proof. Let k be the number of components in F. Then 



D{F,F)<1 E "^^^oh 



2 



And for each i € [k'] 



\wiFi- WjFj\\i<\wi- ii;j|+min( Wj,Wi) max \\Fi - Fj\\i 

— ^ — ^ — ^ j£7r^i(z) 



We can then apply Fact 25 and the assumption that each Gaussian has variance at least ^ (and if e << 1) 
implies that \\F, - Fj\\i = 0{Dp{F,, Fj)) = 0(e) for all j G n-^ii). And so D{F, F) < 0{k')e □ 



D.2 Windows 

Here we define the notion of a Window. Suppose we run the Basic Univariate Algorithm with tar- 
get precision of e (and an error parameter 5). Then Basic Univariate Algorithm uses at most some 
polynomial in ^ and j number of samples. 

Here that we assume the Basic Univariate Algorithm run with precision e and an error parameter 
5 requires some polynomial in 1 and j samples. We in fact assume that the number of samples is at most 
CB(e(5)~^^ for some universal constants CB,Cb > 0. Then we denote Q{e,S) as -^{eSY'^ . 

Definition 11. Let Q{e,6) be the inverse of the number of samples needed by the Basic Univariate Al- 
gorithm when given a target precision e (and an error parameter 5). 

We would like to define a Window to be the range of values from Q{e, 5) to e so that if all pairs of Gaussians 
either have parameter distance at least e, or statistical distance at most Q(e, 5) then the we can just run the 
Basic Univariate Algorithm and assume that the algorithm behaves as if each pair of Gaussians that 
is extremely close is replaced with a single (appropriately) chosen Gaussian. However, we will need some 
slack, and so we make the Window wider so that we can take union bounds over many different runs of the 
algorithm, and compose different subdivisions. 

Definition 12. Let R{e,S) = '^^j^i and let S{€,S) — ^(j^^j^J for some sufficiently large constants Ci,C2- 

Definition 13. Given a target precision e, we define the Window W{e) at e as the range of values [R{e, S),e]. 

Definition 14. Given a mixture of Gaussians F, we will say that a Window W{e) is good if for all i 7^ j, 
Dp{F,,Fj)iW{e). 

We give a number of claims that will be useful in the case in which we have a good Window W{e). So 
suppose that the Window W{e) is good 

Claim 29. The set of Gaussians at parameter distance at most i?(e, S) is an equivalence class. 
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Proof. Consider Gaussians Fi, F2 and F^ such that Fi and F2 are at parameter distance at most i?(e, 5) and 
F2 and F3 are also at parameter distance at most i?(e, 6). Dp{Fi,Fz) < D{Fi, F2)+D{F2, F3) < 2i?(e, S) « e 
and since there is no pair of Gaussians with parameter distance inside the Window W{e), this implies that 
DpiF,,F3) <R{e,6). □ 

We will let £ — {£1,82, ■■■£k'} be the equivalence class of Gaussians at parameter distance at most i?(e, S). 
We let Tr£ : [k] — >■ [k'] be the mapping function that maps a Gaussian Fj to the corresponding equivalence 
class £i (i.e. T^eij) = i)- From this equivalence class and this mapping function, we can define a natural 
i?(e, (5)-correct subdivision of F. 

Definition 15. We define the natural R{e, 6) -correct subdivision F^ as a mixture of k' Gaussians in which 
Ff is an arbitrarily chosen representative from £i (tt^ (j) = i), and wf — ^ 



Wi. 



Clearly, {F^ .tts) € Vri(e,5)iF), and F^ actually is an i?(e, (5)-correct subdivision. 
Claim 30. Let {F,tt) e Vn^,^s){F), then F^ £ Voik)R{..5){F) . 

Proof. Let fc', k" be the number of Gaussians in the GMMs F^ and F respectively. Consider any two 
Gaussians Fi,Fj that are not mapped to the same equivalence class - i.e. n£{i) 7^ t^sU)- Since the Window 
W{e) is good, this implies that Dp{Fi,Fj) > e. So in order for F to be an i?(e, J)-correct subdivision, it must 
be the case that 7r(i) ^ 7r(j). 

This means that tt as a partition is a refinement of the partition tt^ . Formally, there must be some function 
TTmt : [k"] — 7> [k'] such that tt^ — iTint o tt. Then it follows that {F^,Trj) G 1^o{k)R{e S){F)- Consider any 

And T,he7r-\t) = Eje^ri(») T,he7r-^j) so this implies that 



And similarly for any j S 7rj„t(j) let ft- = tt ^(j), 

Dj>{Ff,F,) < Dp{Ff,Fh) + Dp{Fh,F,) < 2R{e,5) 

where the last line follows because 7T£{h) — i. □ 

Lemma 31. Suppose we are given a mixture of Gaussians F = '^i-^if^i: : ^) in near isotropic 

position, where Wi > e and the Window W{e) is good and suppose further that af > \ . Let{F,TT) G 15^(^,5) (i^). 
Then with probability at least 1 — 26, the output of the BASIC UNIVARIATE ALGORITHM is a GMM N such 
that N eVoik)e{F). 

Proof. Le t £ = {£i,£2, ■■.£k'} be the equivalence class of Gaussians at parameter distance at most R{€,6) (see 
Claim 
function. 



29 1, and let F^ and tt£ be the natural i?(e, i5)-correct subdivision for F and corresponding mapping 



Let k" be the number of components in F. Then we can apply Claim 30 and this implies that F^ is an 
0(fc)i?(e, (5)-correct subdivision for F. 



Using Lemma 



28 



this implies that D{F, F^) + D{F^ , F) < 0{k'^)R{e, 6). So this implies that given ^ 



samples taken from F when running the Basic Univariate Algorithm, with probability at least 

>l-d 



Q(£,5) 



^ 0{k^)Rie,S) 



we can assume that all samples came from F^ (because there is an approximate between F and F^ that 
fails with probability at D{F, F^) and with probability at least 1 — 5 this coupling will never fail, given the 
number of samples obtained from F). 

When the Basic Univariate Algorithm is run on F^ , the constraints of the Basic Univariate 
Algorithm are met because for all i ^ j, Dp{F[ ,Ff ) > e because the Window W{e) is good. So with 
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Algorithm 2. General Univariate Algorithm 

Input: e, k, sample oracle SA(_F), where _F is a mixture of at most k Gaussians, is 
e-statistically learnable and is in isotropic position. 

Output: F which is a mixture of at most k Gaussians, and is an e-correct 
subdivision of F 

1. Set ei = ^^^,62 = 5(61,(5), ...Ei + l = S(ei, 5), ...6^2 + 1 = 5(6^2,5) 

2. Let SA(F') =AA(0, l)oSA(F) 

3. For all i e [k'^] , ^BaSIC UNIVARIATE ALGORITHM(ei, 5, SA(F')) 

4. For all T C [fe^] 

5. if |r| > ^ and the T-sequence of estimates is an 0(fc)ei correct chain 

6. Output o A/'(0, — I) , where i = minjgTj 

7 . end 

8. end 

9. Output FAIL 



Figure 3: General Univariate Algorithm. 



probability at least 1 — (5, the Basic Univariate Algorithm (when run on F^) will return an e-correct 
subdivision N of F^ (in fact, a stronger guarantee is true because the BASIC Univariate Algorithm will 
actually return an estimate N that has k' components, which matches the number of components in F^). 



Then we can apply Lemma 24 and N must then be an 0(fc)e-correct subdivision for F. 



□ 



D.3 Reaching a Consensus 

Definition 16. We call a sequence of GMMs, F^,F'^, ...F^ an e-correct chain if for all i E [r — 1], i^'+i g 
P,(F'0 

Theorem 32. Suppose we are given a mixture of Gaussians F = '^i^iWiJ^{iii,af,x) that is in isotropic 
position, where wi > e. Then the General Univariate Algorithm will return a GMM ofk'<k Gaussians 
F such that F is an e-correct subdivision of F. 

Proof. Given e, we first define a sequence of parameters where 

ei = cJ^'''^ ^ S'(ei,5), ...Ei+i = S'(ei,(5),...efe2+i = S(tki,S) 

Suppose first that each Gaussian in F has variance at least \. Then in this case, the idea is to run 
the Basic Univariate Algorithm for a number of different precisions, each of which corresponds to a 
particular Window. We will choose parameters so that these Windows are disjoint, and then because a 
Window is bad iff there is some pair of Gaussians with parameter distance contained inside the Window, at 
most (2) < \ Windows can be bad. So this will guarantee that a strict majority of the computations are 
correct. 

To formalize this, given the sequence of parameters ei, £2, ...efe2_(.i we define a sequence of Windows W = 
iy(ei),M^(e2),...W^(efe2+i)- 

Claim 33. The sequence of Windows W is disjoint 

Proof. If we consider the Window W{ei), the largest value contained in any Window W{ej) for j > i is the 
largest value contained in the Window W{ei+i) which is ei+i. Yet e^+i = S{ei,S) and the lower bound for 
the Window W(ei) is R{ei,6) and R{ei,5) » S{€i,S). Similarly, the smallest value in W{€j) for j < i is 
the smallest value in W{ei). So this implies that for any i, the set of Windows W{ei),W{e2), .■.W{ei) are 
separable from the set of Windows W{ei+i), ^^(£^+2)5 ■■■W{ei^2^i) and this implies the claim. □ 
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Suppose running the Basic Univariate Algorithm on Window W{ei) returns an estimate F\ 

Definition 17. Given any subset of indices T C [fc^ + 1], let ii > 12 > ■■■ij > ■■■i\T\ be the indices in T 
arranged in decreasing order. We can generate a sequence of estimates F^, P!^, in which — F^^ . Also 

let prec{F^) — Ci-, which corresponds to the precision of the Window that returned the estimate F,^ — F'^ . 
We call this sequence the T-sequence of estimates. 

Note tliat this sequence of estimates F^, ...Fj^ ' is arranged in order of coarsening precision - i.e. prec{F^) « 
prec{F^+^). 

Claim 34. S{prec{F^), 6) > prec{F^"^) 

Proof. Let ii > 12 > ...ij > .-.iiTl be the indices in T arranged in decreasing order. So ij-i > ij. Then 
S{prec{F^),S) — S{ei^,S) — ei^+i. And because > ij + 1, it imphes that < ei^+i, and this yields 
the claim. □ 

Let G C [fc^ + 1] be the set of indices of Windows that are good - i.e. W{ei) is good iff i e G. Then let 
Fq,Fq, ...F^Q^ be the G-sequence of estimates. Because the sequence of Windows W is disjoint, and each 
pair of Gaussians (and the corresponding parameter distance) can only make a single Window bad, the set 
G is a strict majority - i.e. |G| > + 1] — G|. 

Claim 35. The G-sequence of estimates is an 0{k)ei- correct chain, and Fq is an 0{k)ei- correct subdivision 
forF. 

Proof. Let e'^ « £3 << ■■■^\g\ be the sequence of precisions given by prec{FQ) , prec{FQ) , ...prec{F^'^^). 



Using Lemma 
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Ph ^ '^o{k)e{{F). Because 0{k)e[ < 0(/c)S'(e^, (5) < R{e'2,S) (using the above claim) 
this implies that Fq e T^R(t'2,s){P) ^-^d so we can apply Lemma 31 again and e Vo(k)e'.,iF^)- Continuing 
this argument, for all i, F'+^ e Vo(k)e'^{F'). Since 0(fc)e- < 0(fc)ejg| < 0{k)ei, the sequence F^,F^, ...P'^^ 
is an 0(fc)ei-correct chain. □ 

Given a subset G' C [fc^ + l]j we can check if the G'-sequence of estimates is an 0(A:)ei -correct chain 
because this property is only a function of the estimates. Then if we consider all sets in 2^'^ "'""'^1, we will find 
some set G' C [fc^ + 1] that is a strict majority (i.e. |G'| > |[fc^ + 1] — G'|) and the G'-sequence of estimates is 
an 0(A:)ei-correct chain. Because G' is a strict majority, and a strict majority G of the Windows are good, 
G n G' ^ 0. Suppose that g G G H G' , and let j the value such that g is the j*'* largest index in G'. 

Given the G'-sequence of estimates, we can take the sequence S ^ F, F^,,Fq!'^, ...F^^ ' . Since the index g 
corresponds to a good Window {W{eg) is good), the computation Fq, (which corresponds to the estimate F^) 
is at least an 0(fc)ei-correct subdivision of F. So the sequence S is an 0(A:)ei-correct chain. So we can apply 

Lemma 24 and this implies that Fq! ' (i.e. the last estimate in the sequence S) is an {Ck)'^ "'"^ei -correct 

subdivision for F. Since {CkY ^^^i < e, this implies that P]^, ' is an e-correct subdivision for F. 

However, we have assumed thus far in the proof of this theorem that each Gaussian has variance at least 
|. So given samples from _F, we can add random noise to each sample. We add Gaussian noise of variance 
i and mean 0, and this corresponds to convolving the original distribution F by A/'(0, ^) to obtain a new 
distribution F' . Then this distribution F' has each Gaussian with variance at least ^ and is also in nearly 
isotropic position - because the original mixture F was in isotropic position, and convolving by A/'(0, |) just 
adds i to the variance of the mixture {var{F') = | + var{F)). 

Using the above argument, we can recover an estimate Fq, ' that is an e-correct subdivision for F' . We 

can subtract ^ from the variance of each component in Fq, , and then using Claim 27 this resulting mixture 
N will be an e-correct subdivision for F. 

□ 
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E Exponential Dependence on k is Inevitable 



We restate the main proposition that we prove in this section: 

Proposition 15 , There exists a pair Di, D2 of l/(4/c^ + 2)-standard distributions that are each mixtures of 



fc^ + 1 Gaussians such that 

\\D,-D2\\i<Uke-'''/^\ 

The foUowing lemma will be helpful in the proof of correctness of our construction. 

Lemma 36. Let Fk{x) = Cfc -^e-'-'/'''^'' JV{i/k, 1/2, x), where Ck is a constant chosen so as to make 

Fk a distribution. 

||Ffe(x),AA(0,l,a;)||i < lOfce-'^'/^^. 

Proof. The probability density function Fk (x) can be rewritten as Ffc (a;) = (cfeCi/fc(a;)A/'(0, 1/2, a;))o7V(0, 1/2, x), 
where Ci/kix) denotes the infinite comb function, consisting of delta functions spaced a distance 1/fc apart, 
and o denotes convolution. Considering the Fourier transform, we see that 

HFk){s) = Ckk iCk{s) o AA(0, 2, s)) AA(0, 2, s). 

It is now easy to see that why the lemma should be true, since the transformed comb has delta functions 
spaced at a distance k apart, and we're convolving by a Gaussian of variance 2 (essentially yielding nonover- 
lapping Gaussians with centers at multiples of k) , and then multiplying by a Gaussian of variance 2. The 
final multiplication will nearly kill off all the Gaussians except the one centered at 0, yielding a Gaussian with 
variance 1 centered at the origin, whose inverse transform will yield a Gaussian of variance 1, as claimed. 

To make the details rigorous, observe that the total Fourier mass of J^{Fk) that ends up within the 
interval [—k/2, k/2] contributed by the delta functions aside from the one at the origin, even before the final 
multiplication by Af{0, 2), is bounded by the following: 

CO 

2cfcfc^ / Af{0,2,x)dx = 2ckky] I _J\f{0,l,x)dx 



^^J(i-l/2)k j^;^ J(i-l/2)fe/V2 



^ />oo 

2cfeA:V / 



i=l 

00 ^ 

< 2ckky -^=-, iT^rrt- 

Additionally, this Li fourier mass is an upper bound on the L2 Fourier mass. The total Li Fourier mass 
(which bounds the L2 mass) outside the interval [— fc/2,A:/2] contributed by the delta functions aside from 
the one at the origin is bounded by 

2cfc / 2max(A/'(0,2,2/))7V(0,2,a;)dx < 4ck / Af{0,2,x)dx 
Jk/2 y Jk/2 



< 4cfc / Af{0,l,x)dx 



Thus we have that 



ll^(Ffe) - CkkM{0, 2)AA(0, 2)||2 = mFk) ~ Ckk-^MiO, 1)||2 < 4e-'^'/« + 4-^e-'='/«. 

From Plancherel's Theorem: Fk, the inverse transform of J^{F), is a distribution, whose L2 distance from a 
single Gaussian (possibly scaled) of variance 1 is at most 8e~''' To translate this L2 distance to Li distance, 
note that the contributions to the Li norm from outside the interval [—k, k] is bounded by 4 J^^ A/'(0, 1, x)dx < 
4— 4=e~'^ Since the magnitude of the derivative of Fk — Cfcfc I— A/"(0, 1), is at most 2 and the value of 
Fk{x) — Cfcfc ^J^ A/"(0, 1, x) is close to at the endpoints of the interval [— fc, k], we have 

max {\Fk{x) - Ckk-^Af{0,l,x)\)] /(4 • 3) < / \Fk{x) - CkkAfiO,!, x)\^dx, 
xe[-k,k] 2v27r / J-k 
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which, combined with the above bounds on the L2 distance, yields max^g[_fc ^.j (|_Ffc(x) — Cfcfc ^^^ A/"(0, 1, a;)|) < 
(72e"'='/*)i/3. Thus we have 

Zy 2iT k\J It: 

The lemma follows from the additional observation that 

\\M{% 1) - Cfefc-i=AA(0, l)||i = min(||c,fc-^AA(0, 1) 
where the minimization is taken to be over all functions that are probability density functions. □ 



Proof of Proposition 15' We will construct a pair of 1/(4A:^ + 2)-standard distributions, Di,D2, that are 
mixtures of /c^ + 1 Gaussians, whose statistical distance is inverse exponential in k. Let 



k 



^ ' i=-k^ 

k^ k^ 

D2 = \c'k E AA(0,l/2,*/fc)AA(»/fc,l/2) + — i— AA(*/fc,l/2), 

i — — k'^ i—~-k^ 
k^ 

where c'^. is a constant chosen so as to make cj. J2i=-k^ -^(0, 1/2, i/k)Af{i/k, 1/2) a distribution. Clearly the 
pair of distributions is l/(4fc^ + 2)-standard, since all weights are at least l/(4fc^ + 2), and the Gaussian 
component of Di centered at can not be paired with any component of D2 without having a discrepancy 
in parameters of at least l/2fc. 

We now argue that Di,D2 are statistically close. Let D2 = c'^.^-^_f,2 J^{0, 1/2, i/k)Af{i/k, 1/2). Note 

that Fk{x)dx < J\f{0, 1/2, x)2inaxyiAf{0, 1/2, y))dx < f^e^'^' < 2e~''\ and thus \\D'2 - Fk\\i < 

— k^ 

8e , and our claim follows from Lemma 
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□ 



F Partition Pursuit 

F.l Paired Estimates 

We first need to ensure that if we consider two directions r, rx,y that are e2-close, the parameters of a 
component in -Pu[^] cannot change too much as we vary u from r to r^ y. 

Claim 37. Given a mixture of k n-Dimensional Gaussians F = WiFi that is in isotropic position and is 
e- statistically learnahle, for all i, \\^i\\,\\Y.i\\2 < \. 

Proof. For all i,j s.t. H/i^ — < ^ because if we project onto the direction p^^^^^y the variance of the 

mixture _F is 1 and is also at least WiWj\\ni — and this implies that — < ^. Yet the convex hull 

of fii for all i contains the origin and so 

ll^i - 0|| < maxll^i - ^jll < ^ 
3 e 

Similarly, for any i e [k], if we choose u corresponding to the direction of the maximum eigenvector of E^, 

1 = var{Pu{F)) > WiU^Y^iU ^ Wi\\Y,i\\2 
and so ||S,||2 < 7. □ 

Suppose F is an n-dimensional GMM that is e-statistically learnable. 

Definition 18. Let F",F'" be univariate mixtures of Gaussians. Then we call components F^,F^ paired 
estimates if there is some TTu,Try and i e [k] such that 7r„(i) = a,Try{i) = b and (F^,?:^) G '^eiiPu[F]) and 
(F^7r„) GP,,(P,[F]). 
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Algorithm 3. Solve 

Input: n > 1, £2 > 0, basis B = (61,..., 6,1) G R"^", means and variances m'',7j'\ and 
m'-',u'-' G R for each i,j G [n] . 
Output: /i G R", Eg R"""". 

1 . Let t)* = i , w"^ and « = 4^ , i;"^' . 

2 . For each i < j £ [n] , let 

_ i/n(u — «' — u-' ) + u° ^ v^^ 



{2e2 + ^)2el {2e2 + ^)4:e2 2e2y^ 2e^ 



3. For each i > j £ \n] , let Vij = Vj» . (* So V" G R">^" *) 

4 . Output 



6,, E = B largmin||M-l/ 
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Figure 4: Solve . 



Claim 38. Let {F", 7r„) e and (F", 7r„) e iPv[F]), then for every component in F" , there 

is some component F^ such that the components F^,F^ are paired estimates. 

Proof. This follows because tTu is onto. □ 

Suppose u,v are e2-close (i.e. ||m — v\\ < 62), and let F", and F^ be paired estimates. 
Claim 39. Dp{F^, F^) <2ei + ^. 

Proof TTu,TTy and i G [k] such that 7r„(i) = a, 7r^,(i) = b and (F", 7r„) G {Pu[F]) and (F"", vr^,) G V^^ iPv[F]). 
Then 

DpiF^Jn < ^p(-F'a , + (P„ [F,] , P„ [F,;] ) + I3p(P„ [F,] , < 2ei + Dp(P„[F,], F^F,]) 

And we can write: 

Dp{Pu[F,],P,[F,]) = \nf{u^v)\ + 

Note that 

lu^SiU - = |(u + M - v)'^T,i{v + u-v)- v'^T.ivl < 2||Sj2e2 + l|Si||2e2 
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this 



Then this imphes that Dp{Py,[F,], Py[Fi]) < ||u - +2|li;,||2e2 + ||Sj2ei and if we apply Claim 
is at most — , and this implies the claim. □ 



F.2 Reconstruction 

Lemma [7[ [TT] Let e2,ei > 0. Suppose \m° - fi ■ r\,\m'3 - ^ • r*^|, |w° - r'^Er|,|u*^ - (r*J)^I]r*J | are all at 
most ei. Then Solve outputs /t G R" and t G R"""" such that ll/t - n\\ < and IIS - Dili. < 

Furthermore, S ^ and E is symmetric. 

We will again need the notion of a Window: 

Definition 19. Given a target additive error e, we call a Window W = (ei, £2, £3, £4) well-separated if the 
following conditions hold: 

1. max(^,^) < e 

2. f +ei «e3 

3. f +ei +63 « £4 

Definition 20. We say that a univariate estimate F = '^^wiFi (strongly) satisfies a Window (£1, £2, £3, £4) 
if for all pairs Fi,Fj, the parameter distance is either at most £1 or at least £4. We say instead that the 
estimate (weakly) satisfies the Window if all pairwise parameter distances are at most £1 or at least £3. 
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Claim 40. Given any univariate estimate F that (weakly) satisfies a Window W = (ci, £2, £3, £4), the set of 
components of F with parameter distance at most ei is an equivalence class. 

Let u,v be two directions that are e2-close - i.e. ||u — t;|| < £2- Suppose that {F'^,tTu) G I?ei(F„[i^]) 
and (F^yTTy) G 2?^^ (P„[i^]). Suppose further that F" and (weakly) satisfy the Window (ei, £2, £3, £4). Let 

— {£^,£2, ■■■£k'} and = {£^,£2, ■■■£k"} be the equivalence classes of components of F",F" respectively 
at parameter distance at most £1. 

Lemma 41. Then k' — k" , and there is a permutation 7r„_„ : [k'] — > [k"] such that Pu[Fj\ is mapped to the 
equivalence class £^ by the mapping 7r„ iff Py[Fj] is mapped to the equivalence £^ j.^^ by the mapping tt^. 

Also we can construct t:u,v from the estimates F^,F^. 

Proof. To establish this claim, consider two distinct equivalence classes £^,£^, and let F^,,F^, be arbitrary 
representative. For each component F^, in F"", there is some component Pu\Fi\ in F[,[F] that is mapped by 
-Ku to F",. Then let Pu[Fi],Pu[Fj] be mapped to F^,,F^, respectively - i.e. 7r„(j) = a',TTy{j) — b' . Then since 
F" (weakly) satisfies the Window W, we have that Dp{F^,,F^,) > £3. 

Suppose that Py[Fi], Py[Fj] are are mapped to F^,,F^, and these two components are in the same equiv- 
alence class in the mixture F" . Then Dp{F^,,F^,) < t\. Yet F^,,F^i are paired estimates so using Claim 
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Dp{F^,,F^,) < 2£i + and similarly for F^,,F^,. Then Dp{F^,,F^,) < £1 + 4£i + ^ using the triangle 
inequality, but this contradicts the above implication that Dp{F^, , Ff^,) > £3 because £3 >> £1 + ^. 

This implies that every every two components in F" that are in a different equivalence classes must be 
each paired to to two components in F" that are also in a different equivalence class. The claim is symmetric 
w.r.t. u, V, so this implies that F", F" have the same number of equivalence classes. 

And also consider any component F^. Using Claim 38 there is some component FJJ' so that F^,F^ are 



paired estimates. Then using Claim 39 Dp{F^,F^) < 2ei + Yet for any component F^ that is not in 
the same equivalence class as FjJ', 

DpiP:, F,") > DpiFi;, F,") - DpiP:, Pn > £3 - 2£i - ^ 

where the last line follows because F" (weakly) satisfies the Window W. So we can construct tTu.v given 
just F'^,P^ because for any pair of equivalence classes £",£J, if there is a pair of Gaussians that are paired 
estimates, the parameter distance between any representative from £f to any representative from £'^ must 
be at most 4£i + — . Yet if there is no such pair of Gaussians, one from each equivalence class, that are 
paired estimates, the parameter distance between any representative from £f to any representative from f J 
is at least £3 — 2ti — so we can distinguish these two cases because £3 >> £1 + ^. □ 

Let W ^ (£1, £2, £3, £4) be a well-separated window. Suppose for some root direction r, and £2-close-by 
directions r^^y as in the Partition Pursuit Algorithm, we run the General Univariate Algorithm 
with precision £1 and for each run we get an estimate F^^** that (weakly) satisfies the Window W . Then 
suppose we run Solve given the directions r, r^ y and the estimate F^'**. 

Claim 42. Solve returns an n-dimensional estimate F that is an e-correct subdivision of F. 

Proof. We can apply Lemma |4l] and find a partition of all equivalence classes that arise in any estimate in 
any direction, into sets = {£f , £2 7 •■■'^'12} with the property that for all F^, there is some h such that 
in each direction r^^y, Fi is mapped some equivalence class in Suppose in direction r^^y, Fi is mapped 
to the equivalence class £j\ Then we can take an arbitrary Fj* in this set, and use these parameters as an 
estimate for the projected mean and projected variance of Pr^ ^ [Fi] and these estimates will be 2ti close in 
parameter distance to the actual values. So we can apply Lemma [Tj and the component Ph of the estimate 
F output from SOLVE that has parameter distance at most £ to F^. So for every component F^, there will be 
some estimate Pt output from SOLVE that has parameter distance at most £ to F^. Additionally, for every 
set of equivalence classes £^ , there is some component Fi with the property that in each direction r^^y, Fi 
is mapped some equivalence class in £^ . So the mapping from a component Fi to an estimate Ph that is 
£-close in parameter distance, will be onto. Lastly, given any partition into sets £^ ^£^^ ...£^ , we can choose 
the weight Wh to be the sum of the estimated weights in any equivalence class fj* in the set, and because 
the General Univariate Algorithm returns an £i-correct subdivision, this aggregate weight Wh will be 
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within an additive kei of the actual aggregate weight of the components Fi that are e-close in parameter 
distance to Fh- □ 



F.3 Observed Components 

Definition 21. Given precision t\ (given to the General Univariate Algorithm j, we say that the 
number of observed pairs in the estimate F returned is the maximum value of ) such that there is a subset 
of k" components of F with the property that every pair is at parameter distance > ei . And we will say that 
the number of observed components is k" . 

Suppose we are given any well-separated Window W = (£1,62,63,64), and an estimate F that (weakly) 
satisfies the Window W . Suppose further that the set of equivalence classes £i,£2, ■■■£k' (of components in 
F at parameter distance at most 6i)has k' elements. 

Claim 43. The number of observed components is k' . 

So let u,v be two directions that are e2-close (i.e. — v|| < 62), and let F",F^ be the estimates returned 
by the General Univariate Algorithm when given target precision ei, for the directions u, v respectively. 
Suppose further that F" (strongly) satisfies the Window W. 

Claim 44. Then the estimate F" will either (weakly) satisfy the Window W = (ei, 62, 63, 64), or the number 
of observed pairs in F"" is strictly more than the number observed in F" . 

Proof. Since the estimate (strongly) satisfies the Window W — (ei, 62, £3, £4), it also (weakly) satisfies this 
Window. So we can apply Claim [43] and this implies that there are k' observed components in the estimate 
F" (if there are k' equivalence classes of components in F" at parameter distance at most £1). 



Let F^,F^ be two arbitrary components in F". We can apply Claim 38 to get two components F^,F^ 
in F" such that F^ and F^ are paired estimates, and similarly F^ and FJ are also paired estimates. 

Suppose F^,F^ are not in the same equivalence class in F". This implies that Dp{F^ , F^) > £4 because 



F" (strongly) satisfies the Window W. Using Claim 39 we get that 



i^p(i^^i^J)>£4-4£i~^ »£3 

SO this implies that the parameter distance Dp{F^ , F^) does not contribute to F" not (weakly) satisfying W. 
So suppose F^,Fl^ are in the same equivalence class in F". Then using Claim 39 we get that 



i^p(F,^F,^)<£l+4£l + ^ «£3 

because D{F^,F^) < £4. 

This implies that the only way that the Window W could be not (weakly) satisfied if there is some pair 
F^, FJ for which the paired estimates of each are in the same equivalence class in F", and yet Dp{F^ , F^) > £4. 
So for each other equivalence class in F" (other than the one that F", F^ are in), we can select a representative 



component FJ*, and for each one we apply Claim 38 and find a corresponding component F^, . If we take this 
set, and F^,F^ this is a set of fc' + 1 components, and using the above argument all pairs of distances are at 
least £3 >> £1, except for the pair Dp{F^ , F^) which is still > £1, so we have k' + 1 observed components in 
F^ if F^ does not (weakly) satisfy the Window W. □ 



F.4 Partition Pursuit 

Theorem [sj Given an £-statistically learnable GMM F in isotropic position, the Partition Pursuit 
Algorithm will recover an £-correct sub-division F and if F has more than one component, F also has more 
than one component. 

Proof. Given an £-statistically learnable GMM F in n dimensions (and in isotropic position), we can project 
onto a direction r chosen uniformly at random. Using LemmajlSj we can instantiate the Partition Pursuit 
Algorithm with a Window W — (£1, £2, £3, £4) with £4 = poly{e, ^) so that there is at least one pair of 
Gaussians (with high probability) that when projected onto r are at parameter distance at least 64. So when 
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Algorithm 4. Partition Pursuit 


Input 


: e 


, k, sample oracle SA(F), where F is a mixture of at most k Gaussians, is 


e-statistically learnable and is in isotropic position. 


Output : 


F which is a mixture of at most k Gaussians, is an e-correct subdivision 


of F 


and 


if F has more than one component , F also has more than one component . 


1 . 


Set 




2. 


Choose r uniformly at random 


3 . 


Let 


= (ei, €2, ea, £4) be a well-sepeirated Window 


4 ^ 


< retry >: Choose a basis B = &2, &n) £ W^^" uniformly at random among all 
bases for which r = Y^" , 


c 

O . 




-General Univariate ALGORiTHM(ei, r^SA(F), 5, fe) 


6. 


While does not strongly satisfy W 


7. 




Shift Window W: (ei, £2, £3, £4) ^ (e'l, £2, 4, ei) where (£i, £3, £3, £1) 
is a well-separated Window 


8. 




F" General Univariate ALGORiTHM(ei,r'^SA(F),5, fc) 


9. 


end 




10. 


Set 


= ^ _|_ g^bi + ejftj 


11. 


For 


i,j e [n] 


12. 




.(—General Univariate Algorithm(£i, (r''^)'^SA(F),5, fc) 


13. 




if F*'^ does not weakly satisfy W 


14. 




Set r ^ r''^ 


15. 




Shift Window W : (fi, £2, £3, £4) <- (£1, £2, £31 ei) where (fii, £2, £3, £1) 
is a well-separated Window 


16. 




jimp to < retry > 


17. 




end 


18. 


end 




19. 


F ^ 


SoLVE({F*'^}i,^, F^ £1, £2, W''}ij,r) 


20. 


Output F 



Figure 5: The Partition Pursuit Algorithm. 
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we run the General Univariate Algorithm after projecting onto the direction r, the estimate returned 

will have at least two components in order for it to be an ei correct subdivision for Pr[F]. 

If the estimate returned by the General Univariate Algorithm does not (strongly) satisfy 
the Window W, we can perform a shifting operation on the Window W to obtain a new Window W' = 
(e^, £2, Cg, ei) so that W is also well-separated and the number of pairwise components observed has strictly 
increased. So eventually we can find a Window W = (e'j^, ej, £3, £4) such that the estimate F^ returned by 
General Univariate Algorithm run with target precision ei (strongly) satisfies the Window. Because 
the number of observed components strictly increases each time we perform a shifting operation, the number 
of times that we must slide the Window is at most k. And each time we slide a Window, the parameters of 
the new Window are polynomially related to the parameters in the old Window. So the precision e[ of this 
Window will be some polynomial in the original precision ei. 

So the total number of times that we need to slide the Window is at most fc, and this implies that the 
parameters we need remain polynomially lower-bounded in e, -. And when we need to perform no more 
slides, we have reached a root direction r such that the estimate returned by the General Univariate 
Algorithm is (strongly) consistent with the Window W' , and for each direction r^^ the estimate returned 
by the General Univariate Algorithm (weakly) satisfies the Window W as well. 

Using Claim [42] this implies that the output of our algorithm is an 71-dimensional e-correct sub-division 
F for F. □ 



G Clustering and Recursion 
G.l Bi-Partitions 

Suppose the estimate F returned by the Partition Pursuit Algorithm is an ei-correct subdivision for 
F, but is not a good estimate in terms of statistical distance. The only way that this can happen is if there 
is some component of F which has a co-variance matrix that has a very small eigenvalue. In this case, we 
can use this direction (i.e. the eigenvector corresponding to this eigenvalue) to cluster samples from F into 
two sets, and proceed in each set by induction. 

In this section, we give some simple claims that will be useful building blocks for deciding how to cluster. 
Specifically, we will need to choose some clustering scheme for samples coming from F, so that there is some 
bi-partition of the components of F into S C [k] and [k] — S such that any sample generated from Fi {i G S) 
has a negligible probability of being mis-clustered. 

Claim 45. Given a set of k points Xi,X2, ■■■Xk G ^ on the line and the maximum distance between any pair 
is A. Then there is a bi-partition A C {xi,X2, ■■■Xk}, B = {xi,X2, ---Xk} — A such that D{A,B) > (and 
A,B ^(d) and diam{A),diam{B) < A(l - 2''-^). 

Proof. Assume that xi is at least as small as any other value in the set, and assume that X2 is at least as 
large as any other value in the set. Then set A2 — {xi},B2 = {x2}- Clearly D{A2,B2) > A. Consider 
the point x^. Either D(A2,x:}) or D{B2,Xe) must be at least using the triangle inequality (because 
13(^2, i?2) > A). Add the point 2:3 to the side that it is closest to, and the resulting subsets A^.B^^ are at 
distance at least y . Iterating this procedure yields two subset Ak,Bk that are disjoint, have D{Ak, Bk) > 
and AkU Bk ^ {xi,X2, ...Xk}. Also diam{Ak) = maxa;,g^^ Xi - xi < X2 - D{Ak,Bk) - xi < A(l - 2'""^), 
and similarly for Bk ■ So take A — Ak , B — Bk , and this implies the claim. □ 

Claim 46. Given a set of k points Xi,X2, ■■■Xk G on the line that are strictly positive s.t. the maximum 
ratio of any two points in the set is C > 1. Then there is a bi-partition A C {xi, X2, ■■■Xk}, B — {xi, X2, ...Xk}— 
A such that for all Xi € A, Xj G B, 

Xj 

(and A,B and also for all Xi, Xj (z A, ^ < and also for all Xi, Xj G B , ^ < 2^ . 

Proof. Let yi, ?;2, ■•■2//c G 3? be the logarithm of each point Xi - i.e. yi — loga;^. Then the maximum distance 
between any two points in j/i, j/2, ■■■Uk is max^j logXi — \ogXj — max^j- ^ — logC. So let A = logC and 



apply Claim 45 to the set yi,y2,..yk- Then we get a bipartition A',B' of yi,y2,---yk and let A,B be the 



corresponding bi-partition of xi,X2, ■■■Xk - i.e. Xi € A iS yi G A'. 



32 



Then miny^^A',yjeB' Vi - Vj > 5^ and yi - yj log fj-. So this imphes that 



min — > 22fc-i = C^"-^ > 1 

Xi£A,Xj&B Xj 



Also from Clahii 
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we have that metKy^^y^^A' Vi - Vj > - and jji - = log and 



so 



Xi 



max 



and similarly for B. □ 

Let F be a mixture of n-dimensional Gaussians s.t. F is an ei-correct sub-division for F. Also we assume 
that F is in isotropic position. 

Claim 47. Let F be an t- statistically learnahle distribution in isotropic position. Let {F,t:) £ 'D^-^(F). Then 
for any direction r, var{Pr[F]) > 1 — k^0{^) 



Proof. Let fi ~ J2i Wj^j, fi = Wii^i- We can apply Claim 37 to get that ||^ — < ei + kO{^) = 0{^). 
Also using Claim 37 we obtain ||Si||2 < \ and ||E^(j)||2 < 7 

Consider any symmetric matrix A: [u + v)"^ A{u v) ~ Au + 2v'^ Au + Av. And so 

{u^vfA{u + v) < u^Au + 2\\v\\\\u\\\\A\\2 + \\vf\\A\\2 

And we can apply this equation using A = rr'^ , u = fiTT(i) — fi and v = jii — ji — u and note that ||v4||2 = 1, ||m|| < 
0(i + ^) = 0{\) and ||w|| < O(^). Then this implies that {r^ {^h - ^i)f < {r^{fi^(i) - (i)f + O(fcfi). 
Then if we take A to be the discrete distribution with probability wt , and similarly A to be the discrete 
distribution r^ fli with probability m, var{A) > var{A) - OiP^i). 

Also \r'^{'Si — ETr(i))^l — ll^j ~ ^7r(i)ll^' — £- These facts are enough to be able to apply Fact 57 to get 
that var{Pr[F]) > var{Pr[F]) - 0(P|i) □ 



G.2 How to Cluster 

Definition 22. We will call A, B C 3?" a clustering scheme if AC\ B ~% 

Definition 23. For A C 3?", we will write P[Fi^A\ to denote Pr^r^FAx G A] - i.e. the probability that a 
randomly chosen sample from Fi is in the set A. 

Let {F, vr) e ^^^{F). Suppose also that F is a mixture of k' components. 
Lemma joj Suppose that for some direction v, for all i: v^TiiV < £2, for ei < If there is some bi-partition 
S C [fc'] s.t. Vig5.jg[fc']_5|w^/ti — v'^fijl > ^^^^ then there is a clustering scheme {A,B) (based only on F) so 
that for ah ie S,j E TT^'^{i), P[F,,A] > 1 - £3 and for all i^S,j & 7r-i(i), Pr[Fi, B]>1- £3. 

Proof. For each i, consider the interval Li — [v^ fii — ^^,v'^fii + ^^]- Then we will choose A = {x £ 
^^\v'^x e Uigs-^i} and similarly we choose B = {x £ ^"\v'^x G Ui^s^i}- 

We first demonstrate that An i? = 0. Because of how A, B are defined, this condition is equivalent to the 
condition that Ai — Ui^sl-i and Bi — Ui^s^i be disjoint. {Ai, i?i C 3? and Ai n Bi = 0). So consider any two 
intervals Li,Lj for i £ S, j ^ S . Then because i,j are on different sides of the bipartition S, [k'] — S, we get 

that Iv"^ fli — v"^ fLj\ > '^^^ so Li, Lj are in fact disjoint. This implies Ai, Bi are disjoint, and this implies that 
A, B are disjoint. 

Since the standard deviation of Fj in the direction of v is at most \/2e2, points outside /7r(j) are at least 
1/ (2£3) standard deviations from their true mean. Using the fact that, for a one-dimensional Gaussian random 
variable, the probability of being at least s standard deviations from the mean is at most 26^" /{\/2tts) < 
1/s, we get that the probability that x sampled from Fi is outside the range [v'^ fit — ^^,v'^fii + is at 
most £3. And this implies the lemma. □ 
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Let ( F, tt ) e 2?cj^(F). Suppose also that F is a mixture of k' components. 



Lemma 10, Suppose that for some direction v and some i G [k'] such that: u E^w < £„, for em » ei- If 
there is some bi-partition S C [k'] s.t. 

mm.^sv'^'SiV ^ 1 
max(maxj^5 u^EjU, em) 

(and et << £3) then there is a clustering scheme A,B such that for all i G S,j E 7T^^{i), P[Fi,A] > 1 — £3 
and for ah i i S.j E 7r^^(i), Pr[F,, B]>1- £3. 

Proof. Let T — [k'\ — S. Let as ~ miuigg u^EiU, ctt = maxjgT v^YijV. So we are given that ^^^^.^-^ ^ — j > ^. 



Let B„ = yJierh where = [w^ /i, - ^- '- - £1, w Vi + ^ + ei] 



Let Fj be a component in F s.t. 7r(j) = i G T. Then the variance of Fj in the direction v is at most 
(Tt + El ^ 2 max((7T, £„j) where here we have used the condition that £„ >> ei. So any point x outside 
the interval /^(j) is at least l/(2£3) standard deviations from their true mean. Using the fact that, for a 
one-dimensional Gaussian random variable, the probability of being at least s standard deviations from the 
mean is at most 2e~* /{^/2tts) < 1/s, we get that the probability that v^x (when x is sampled from Fj) is 
outside the range is at most £3. 

We will we take as our clustering algorithm B = {x E 5R"|ti"^x € By} and and A = SR" — B, then clearly 
An B = (i). So the above statement implies that Pr[Fj, B] > 1 — £3 for any i ^ S,j £ 7r^^(i). 

Also, for any Fj with 7r(j) e S, the variance when projected onto v is at least &s' — ^i- So the probability 
that a point v'^x (where x is sampled from Fj) is inside the range B^ is at most the measure of By times the 
maximum density of Py[Fj]. This is at most 

^^ Vmax(aT,£.„) ^ 2,^^ , < 2, « £3 

where the last line follows because as >> £m >> because £c > £1 and the ratio ^ r > — is large, 

and because et « £3. 

So we also have that Pr[Fj, B] < £3 for all i E S,j E TT^^{i). So Pr[Fj, A] > 1 — £3, and this implies the 
lemma. □ 



G.3 Making Progress when there is a Small Variance 

Lemma [17] Suppose - fj,i\\ < £1, - E^Hf < ei, and \wi - Wi\ < £1, if either ||S^^||2 < 12^;^ 



or 



|Sr'll2 < then 



Now we can describe the idea behind the hierarchical clustering. Suppose the entire algorithm on fc — 1 
Gaussians requires m samples. Then choose £3 = ^ so that if we take ™ samples in total, then each side in the 
bipartition that results from clustering would get at least m samples and none of the samples obtained from 
the oracle are mis-clustered. Then we can run the k — 1 Gaussian algorithm on each side of the bi-partition 
in order to get a statistically good estimate for the original mixture of k Gaussians. 

Given £3, choose £2 s.t. < ^j^- Also choose em « £2 s.t. {^)^ « £|. Then choose ei << £„. 

Definition 24. We call the set of parameters £1 << em << £2 << £3 good if 

< 2ra£i _|_ < ^2 

2. fc2|^ =0(1) 

5- ^1 < € 



4. 5^=0(2-'=) 
5- (^)* «£i 
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Algorithm 5. Hierarchical Clustering Algorithm 

Input: e, £3, k, sample oracle SA(_F), where f is a mixture of at most k Gaussictns, 
is e-statistically learnable and is in isotropic position. 
Output : 


• 


EITHER: F which is a mixture of at most k Gaussians, is e-close to F and 


if F 


• 


OR: A clustering scheme {A, B) s.t. there is some partition S,T of the 
Gaussians in r {b,i f= v) and for all i € o, ±^rx^Fi[Xi € A\ > 1 — €3 

j e T, Pra^^pA^i e B]>l-e3 


, and 


1 . 


Choose a good set of parameters ei,em,e2,e3 (£3 is already fixed) 




z . 


r i — r^ARTITION r^URSUIT^^ei , oA(^_r J,0, /Cj 




3 . 


If F has only one component 




4. 


Output F 




5. 


end 




6. 


If no component in F has a co-veiriance matrix Sj with Amin(£i) < Cm, 




7. 
8. 


Output F 

Else let Fi = M{jXi,T,i) and v^EiV < 




9. 


If for all h^i, v^TihV < e2 and there is some j =^ i s.t. |v^(Ai~Aj)l " 


.0(1) 


10. 


Find a ^^^-mean separated partition S',T' of components in F 




11. 


Let h = [v'^fi.h - '^,v'^fih + ^] 




12. 


Let A = {x€ ^"\v'^x € UheS'Ih}, B = {x € ^"\v'^x € Uher'} 




13. 


Output {A,B) 




14. 


Else 




15. 


Find a (ejj, em)~variance separated partition S',T' of components in 


F. 


16. 


Let T' be the set of smaller-variance components. 




17. 


Let as',ffT' be the smallest and largest variances in S' ,T' respectively 


18. 


Let ift = [w Hh- ei, -y Hh + h eij 




19. 


Set B = {a; e »"|w'^a; € U^6T'}. ^ = »"-B. 




20. 


Output 




21. 


end 




22. 


end 





Figure 6: The Hierarchical Clustering Algorithm. 
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Suppose we choose a set of good parameters ei << << £2 << £3- Then the Hierarchical Clus- 
tering Algorithm will either return an e-close statistical estimate F for F or make progress by returning 
a clustering scheme. 

Theorem [T2I The Hierarchical Clustering Algorithm either returns an e-close statistical estimate 
F for F, or returns a clustering scheme A, B such that there is some bipartition S C [k] such that for all 
i e 5,j e TT^H*): P[Fi,A] > 1 - eg and for all i ^ SJ G ir-^i), Pr[F^,B] > 1 - eg. And also 5, [fc] - S are 
both non-emtpy. 

Proof. We analyze the output of the Hierarchical Clustering Algorithm via a case analysis: 
• Case 1: Suppose that no Gaussian Fi has any variance (i.e. in any direction) that is at most e^- 
Suppose that no Gaussian Fi has any variance (i.e. in any direction) that is at most em- Then we can 



apply Lemma 
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and because -I- — < e^, and this will imply that the estimate F is statistically close to 



the actual mixture F. 

• Case 2: So suppose there is a Gaussian Fi which has a variance of at most e„j on some direction v. 



Then using Claim |47j var{Pi,[F]) > 1 — 0{k^^). Because the parameters are good, we know that 
fc2£i = 0(1) and so war(P^[F]) = n{l). Suppose that for aU Fj, DpiP^Fi], Py[Fj]) = o(l). In this case, we 
could apply Fact 57 and WjV^YijV = o(l) and similarly var{K) (where A is the discrete distribution on 5ft 
which takes value [xj with probability Wj) will be upper bounded by maxj(i;"^(/ii — /tj))^ = o(l). So if for 
aU Fj, Dp{Py[Fi], P^[Fj]) = o(l), we would have var{Py[F]) = o(l) which is not possible, hence there must 
be some other Gaussian Fj s.t. Dp{Py[Fi], P-u[F.-j\) = il(l). 

• Case 2a: Suppose that each Gaussian Fy^ has projected variance v^flfiV < e2, and there is a Gaussian 
Fj s.t. the difference in projected means \v'^{jii — — ri(l). 

In this case, we can apply Claim |45] to get a bipartition 5" C [k'] (let T' = [k'] - S') such that S'.T' ^% 
and such that for all i € S", j € T', |?r 'Jli — pij)\ > 0(2"'^). Because the parameters are good, we have that 
^^^^ = 0(2"*^). Then we can apply Lemmajojto obtain a clustering so that each successive point sampled from 
the oracle has probability at most es of being mis-clustered, as desired. And since both S", T' are non-empty, 
this clustering scheme returned by Lemma [9] has the property that for either side of the clustering scheme, 
there is some component Fi in the original mixture that is mapped to that side w.h.p. 

• Case 2b: Either there is some Gaussian Fh which has projected variance v^Y^hV > £2, or for all 
Gaussians Fj {j ^ i) the difference in projected means \v'^{fii — /tj)| = o(l). 

Either case implies that there is some Gaussian F^ such that when projected onto u, F^ has variance at 
least 62. In the first case, this is directly true. In the second case, (if we let A be the discrete distribution on 



5ft which takes value v [ij with probability Wj), var{A) = o(l)- And using Claim 47 
must be some component Fh with FhV — f2(l) >> e2. 

So let Fh be the component for which FhV is the largest (and is at least £2). 



and Fact 



57 



then there 



We can do the following: Let Ai C [k'] = {« G [fc'] |t''^EiZ; < Cm}- Let Bx = [k'] — ^1, which is necessarily 



non-empty because h e Bi. Then take B2 — {fm} U {v Y,iv\i € Bi} and we can apply Claim 46 to get a 
bi-partition A3, B3 of B2 with the property that e„j G A3, both ^3, B3 are non-empty and (choosing C = 

in Claim 
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and » i) the ratfo > i » k 



Then every projected variance v S^w is in the set Ai U A3 U ^3. So we can take A to be the set of 
indices i G [k'] such that v'^YiiV G Ai U A3 and similarly we take B to be the set of indices i G [k'] such that 
v^YiiV G ^3. Then A, i? is a bipartition of [k']. 

and this yields 



Also min.gBt, T.,v ^ """'[^A > ^ » \. And then we can apply Lemma 

a clustering so that each successive point sampled from the oracle has probability at most eg of being mis- 
clustered, as desired. Note that i G A, and h £ B,so both of the sides of this clustering scheme are non-empty 
(for either side of the clustering scheme, there is some component Fi in the original mixture that is mapped 
to that side w.h.p.). 

□ 

This completes the description of the Hierarchical Clustering Algorithm. 
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Algorithm 6. High Dimensional Isotropic Algorithm 

Input: k, e, sample oracle SA{F), which is a mixture of at most k Gaussians which 
are e-statistically learnable and in isotropic position 
Output: An estimate F that is e-close to F 


1. Let efe_i = /fa(f,5, fc- 1), £3 = f Cfe-i^ 




2. OUT <— Hierarchical Clustering ALGORiTHM(e, S, es, k) 




3. If OUT is an estimate F 




4. Output F 




5. Else OUT is a clustering scheme A, B 




6. Take rn^ total samples Xi,X2, ■■■,Xm from SA(_F) 




7. Let Xs,Xt be the samples from Xi,X2, ■■■,Xm, that are 


in A, B respectively 


8. Fa <— High Dimensional Anisotropic Algorithm(|, 5, fc 


-l,Xs) 


9. Fb High Dimensional Anisotropic Algorithm(|, 5, 


-1,Xt) 


10 . Output f=^-^Fa+^-^Fb 




11. end 





Figure 7: The High Dimensional Isotropic Algorithm. 



G.4 Recursion 

H The Isotropic Projection Lemma for k Gaussians 

Lemma 
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[Isotropic Projection Lemma] Given a mixture of k 7i-Dimensional Gaussians F = WiFi 
that is in isotropic position and is e-statistically learnable, with probability > 1 — 6 over a randomly chosen 
direction u, there is some pair of Gaussians Fi,Fj s.t. Dp{Pu[Fi\, Pu[Fj]) > |q^. 

Proof. Let £1 = j^j^, and £3 = ^ 
Let t = 2eiy/n/S. 

Case 1: — > t for some i,j G [k]. In this case, by Lemma 50 with probability > 1 — (5, 
\u ■ [p^i — lJ'j)\ > St/y/n = 2ei, as desired. 
Case 2: 



l/ii — /ijll < t for all i,j G 



u^Y^hU < 1 



. By Lemma 48 with probability > 1 — (5, for some /i. 



< 1 



12n2 



< 1-62 



(1) 



57 



If \u- {ni — > 2ei, then we are done. If not, then \u- (/i^ — < 2ei for all i,j G [k]. Then using Fact 
var{A) + WjU^HjU = 1 where A is the discrete distribution on points in 1-dimension which is fij with 
probability Wj. The variance of this mixture A is upper bounded by max^.j fii — which is at most 

4ef < 2ei. 

So this implies 'Y^-WjU^Y.jU > 1 — 2ei. Then we get that X^j^^ft, ^j^'^^j'" ^ 1 ~ ^£ 
w/je2 — 2ei > 2ei > 2(^^-_^^ Wj)ei. So, finally, we obtain '^j^y^WjU^'^jU > + 2ei). So there 

is some j ^ h s.t. uFYjjU > 1 + 2ei, and for this j, Z3p(P„[Fj], Pu[Fh]) > 2t\ and this yields the lemma. □ 

Lemma 48. Let e, (5 > 0, t G (0, e'^). Let F be an e-statistically learnable distribution in isotropic position. 
Suppose for all i,j G [k] that — fij\\ < t. Then, for uniformly random r, 



Wht2 — 2ei and 



Pr 



res 



minjr^Sir} > 1 — 



e(52(e3-t2) 



12n2 



< 5. 



Proof. We can apply Lemma 52 and then apply Lemma 51 So with probability at least 1 — (5, there is some i 
s.t. u^YjiU 



l-c,l + c 



fore !P^,a = 

An ' 3n 



If u EiU < 1 — c then we are done. If instead u Y,iU > 1- 



c, 
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Algorithm 7. High Dimensional Anisotropic Algorithm 

Input: k, e, sample oracle SA(F), which is a mixture of at most k Gaussians which 

are e-statistically learnable 

Output: An estimate F that is e-close to F 

1. Let Efe = SHa{^,S,k) 

4 Y k 

1. Take m — 0{ t^) samples xi,X2, -.-^Xm from SA(_F) 

3. Compute the transformation T that places Xi,X2, ■■■,Xm in exactly isotropic 
position 

4. High Dimensional Isotropic Algorithm fc,f'(SA(F))) 

5. Output F 



Figure 8: The High Dimensional Anisotropic Algorithm. 



we can apply Fact 57 which implies that WjU YijU < 1 and we have that WiU Y^iU > Wi{l + c). We can 
apply Claim 49 which implies that there is then some j ^ i s.t. u^HjU < 1 — ec which implies the lemma. □ 

Claim 49. Suppose + a) + W2(l — < 1, wi,?«2 > e > 0, wi + W2 = 1 and a > 0. Then, j3 > ea. 
Lemma 50. For any lJ-i,fJ-j € R",(5 > 0, over uniformly random unit vectors u, 

Prt,gs„_j [\u - ^J.i~u■ \ < S\\^ii - /ij||/Vn] < S. 

Lemma 51. fT]^ Suppose W^'^h > 1 + a, then Pr„es„_^ [ u^E^u e [1 - c, 1 + c] ] < (5, c = 
Lemma 52. Suppose the mixture F — wiFi is in isotropic position and is e-statistically learnable, and 
that for all i,j £ [k], — jijW < t. Then maxj{ ||E~"'^||2 } > 1 + a, 



a 



3n 

Proof. By Fact [58j the squared variation distance between Fi and Fj is, 

1 " 1 

< {D{F,,F,)f < - ^(A, + _ - 2) + (/ii - M2)^Sr'(Mi - ^2). 



Where Ai, . . . , A„ > are the eigenvalues of E^^Ej. Suppose (^1 — /i2)^2]~^(^i — /12) > V' ^^^^ t'^i^ implies 
||5]~^||2 ^ 7 because \\fii ~ H2\\ < ^7 and we would be done in this case. If not, then from the above equation 
we get 



' A, ' e 
1—1 



In particular, there must be some eigenvalue A, such that, A + l/A — 2>^ (£^~V)~e?' f be a unit 

Then we have that v^Ynv — Xv^Yiji 

'v'^'SiV ,\ /v'^Yi-jV \ , 1 „ 6a 



(eigen)vector corresponding to A, i.e., v = AE^ ^EjW. Then we have that w^E^w = Af-^EjW and this yields 



(^-l) + (^-l)^A+i-2> 
ijjV / l^iV / A 



Since one of the two terms in parentheses above must be at least 3a/e^, WLOG, we can take > l+3a/e2. 

This means that the numerator or denominator is bounded from 1. We can break this into two cases. 
Case 1: v^TijV < 1/ [1 + a). This establishes the lemma immediately. 

Case 2: u^E.-y > (1 + 3a/e2)/(l + a) = 1 + (B/e^ - l)a/(l + a) > 1 + (S/e^ - l)a/2. By Claim 
'Ylh '^iiv'^^hV < 1, we have there is some g G [k], g ^ i such that 



49 



since 



EgW < 1 - - ^-^ - Ij a < 1 - a. 
This means that ||Eg i||2 > 1/(1 - a) > 1 + a. 

□ 



38 



I Approximate Isotropic Position 



Theorem 53. /77V Let Fi = A^(/ii, Ei). Let m = 0{ ^ )■ T/ien gzwen m samples from Fi, xi,X2, ■■■Xm 
compute fii — — X^i '^'^'^ ^1 — ~12i^i^T ~ t^iP-i ■ -^6^ -f'l = A/'(/ii,Si). T/ien wii/i probability at least 
1-S, D{F^,F^)<0{e-). 

Then, suppose we are given access to an e-statisticaUy learnable distribution F on k components, which 
is not necessarily in isotropic position. Suppose additionahy that our sample oracle gives us the labeling 

(corresponding to which component each sample came from) and we are given m = 0{ "4'"/ ) samples and 
labels {xi,£i),{x2,i2), ■■■ixm,i2), where each £i S [k]. 

Then suppose, from these samples, we construct an empirical distribution F. Consider each component 
F,. We take fi, = juji7=ijj T,j s.t. e,=i and we similarly take = jujipz^j Ej s.t. e,=^ X3^J-^^^f'I■ And 
further, take Wi = 

Corollary 54. Form = 0( "4c+/ ), L){F, F),maxi D{Fi, Fi),ma,Xi \wi — Wi\ < 0{e'^), with probability at least 
1-S. 

Proof. First, consider any i and let = = i}\. Then we can apply Hoeffding's bound and 

Pr[|"^-u;,|>^]<2e-2'"*< A 
m 4fc 4k 



-"In ^ 



because m > ri(-^^|j^). 

So each i receives at least Q{em) — rj( " ^4" ^ ) samples, so using Theorem 53 D{Fi,Fi) < 0{e'^) with 
probability at least 1 — ^ . 

Then D{F, F) < max^ D{Fi, Fi)+J2i = 0(e'^) and the total probability of any bad event occurring 

is at most 5 so this implies the corollary. □ 

Claim 55. E^^p ^ h.Y.i^i E^^p\xx^\ = ^Y^i^^^l 
Proof. 

« « J J' j s.t. ^j=i i j s.t. ej=i » 

And also 



i i i ' ^' J S.t. £,-=i i 



□ 



The transformation T that puts F in isotropic position is only a function of E^^p and i?a;~F[a;a;^], and 
these quantities are computable without the labels ii. So this implies 

Theorem 56. Given an e' -statistically learnable distribution (for e' > e) F, given m = O ( " Is^ ) samples 
from F , one can compute a transformation T such that there is e' — 0(e)- statistically learnable distribution 
F s.t. with probability at least 1 — 5 

• computing an j-close estimate for F is also an 7 + 0{e)-close statistical estimate for F 

• a transformation T places F in exactly isotropic position 



• T can be computed from just the sample points Xi,X2, ...x 

• D{F,F) < 0(e) 



m 
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J Basic Properties of Gaussians 



In this section we state many useful basic facts about univariate Gaussian distributions that are used 
throughout this paper. 

Definition 25. Given a discrete distribution on points in 1-dimension, A, we will define var{A) to be the 
variance of this distribution. 

Fact 57. Given a GMM of 1- dimensional Gaussians, F — '^^WiJ\f(fii,a, 

"(F) = var{A) + ^ Wiaf 



2) 



var[ 

where A is the discrete distribution on points in 1-dimension corresponding to selecting each fii with probability 

Wi. 

Proof. 

i i i i i 

Also E,j;r^p[x] ~ Wifii — E.j.r^/^[x] and combining these equations yields: 

"(F) = E,^f[x^] - {E^^f[x])^ = E^^a[x^] - {E^r^A[x]f +J2w,<j^ = var{A) ^^w^g} 



varl 



□ 

Fact 58. Let Fi — A/'(/ii,Si) and F2 — M{^i2,'^2) be two n-dimensional Gaussian distributions. Let 
Ai, . . . , A„ > 6e the eigenvalues of S]j^^E2- Then the variation distance between them satisfies, 

n ^ 

{D{F^,F2)f < 5](A, + - - 2) + (^1 - M2)^Sr^(A*i - M2). 

i=i * 

Fact 59. 

maxA/'(0, 0-^,7) = 



7V 27re 

Proof. It is easy to verify that argmax^2A/'(0, cr^, 7) = 7^, from which the fact follows. □ 
Corollary 60. 



max A/'(/i, cr^, 0) < max 



2 1 



ti,a'^:ti+a'^>-l \j\/2Tre 

Proof. Either fi > 7/2, or ct^ > 7/2. In the first case, using Fact 



59 



2 

max 7V(/i, cr^, 0) = max7V(0, ct^, 7/2) = 



^i>7/2 0-2 7v27re 

In the second case, we have 

max A/'(0, cr^ a;) = J^{0, 7/2, 0) = — . 



Lemma 61. [Lemma 29 from 11 7/ / Given < 2, 

/ 1^1^(0, cr^,x)dx < O (e-'e-s 

J\x\>l/e ^ 

Corollary 62. 

\x\W{fi,a'^,x)dx < O (max(|/i|, -)'e"^ 
40 



□ 



Proof. Using Lemma |61[ the above bound follows by a change of variables and induction. Note that the 
constant inside the 0() depends (exponentially) on i. □ 

Lemma 63. Given fi, a^, cr'^ such that |^'| < c and e^/'^ < a^,a'^ < 2 and — + — a'^\ < e, 
(and we also assume that ec^ = o(l) and c> 1) then 

I / xWifJ,,a\x)dx- I a:W(/i',(T'^a;)da;| < 0(c*+2ei/s + c'e-^) 



Proof. Consider the interval / = [—2c, 2c]. Then in order to bound max(|A/'(/i, cr^, x) — N{ii' , cr'^, x)\) over /, 
we first bound max(|A/'(^, a'^ , x) — JV {fi' , a'^,x)\) over / and next we bound max(|A/'(^', cr^, x)—JV{fi', cr'^, 
over /. 



Claim 64. 



max(|AA(M', <J^,x) - ^f{^l', a'^,x)\) = O^c^t^l"^) 



Proof. We prove this claim in two parts: first we bound max^g/ |A/'(/i, cr^, x) — Af{p' , cr^, x)|: 
max|A^(^, cr'^.a;) -7V^(^',(T^,x)| = e |l-e ^^^^ 



TTfT 



2 



1 , -2x(m'-Ai) + (h'-m')^ 



Next, we bound the term ma,Xxei{\Af{fJ.',a'^,x) — J\f {fi' , a''^ , x)\) . We accomplish this by bounding both 
max3;g / (A/'(/x', cr^, x) — Af {^i' , a'"^ , x) and max2;g/(A/'(^', cr'^, x) ~ Af{^' ,<7^ , x)). Assume that > cr'^. Then 
it follows that: m.ayi{Af{^i' ,a''^,x) - 7V(^', cr^, x)) = N {^i' , a'"^ , ^i' ) - 7V(^', cr^, = tI;;:!^^ - 5:] because 

J\f{fi',a''^,x) decreases at a faster rate than Af{fi' ,a^,x) whenever A/'(^', cr'^, x) > J\f{n',(T^,x). Also using 
the restriction that a''^,a^ > e^/^ yields - ^] < ^O(^) < 0(\/e). 
Lastly, we bound the term maxj;g/ A/'(/i', cr^ , x) — , a'^ , x): 



e 



2CT2-2e 



v27rcr''^ V 27r cr'^ 

" V27ra^^ 

< 0(^) = 0(cV/6) 

Thus these bounds imply that maxa;g/(|7V(/x', cr^, x) - 7V(/z', ct'^, x)|) = 0(c^e^/^) □ 
So we can use the Claim [64l to conclude that 



1/ x*A/'(Ai,cr^x)dx- / x''A/'(Ai^^7'^x)dx| < / |x|^|AA(//,^7^x)-A/'(Ai',a'^x)|dx = 0(c*+2el/6) 

Jig/ Ja:e/ J xel 

And we can use Corollary [62] to conclude that 
I / xW{fi,a^,x)dx- [ xW(A^',c^'^x)dx| < I / xWifi,a'^,x)dx\ + \ [ xW(^', cr'^, x)dx| < 0(c^e-^ ) 

Ja;^/ ./a;6^/ ^a;^/ Jx^/ 



□ 

k 



Claim 65. The k*^ raw moment of a univariate Gaussian, Mk{Af{p, (t'^)) = '^i^QCifi'a'^^'' *\ where \ci\ < 
(fc + 2)!. 
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Proof. Consider the moment generating funetion Mx{t) = e'^+°' * We claim that '^jt^''*^ = polyidj-, cr, t)- 
Mx{t), where polyi{jj,,a,t) is a polynomial of ^,a'^,t, whose degree when viewed as a polynomial over t is 
at most i, whose degree when viewed as a polynomial over /i, ct^ is at most j, and whose coefficients are 
bounded in magnitude by il. We prove this by induction, with the base case i — 1 being trivial. Assuming 
the statement holds for some value i > 1, we have 

d}M-X{t) dMx{t) dpolyi{n,a,t) 

= poly,{^,a,t).^^ + Mxit) 

= {poly,i,, a, mah + ,) + 'P'^'^f ) MxH) 

Thus polyi+i{fj,,a,t) = polyi{n,a,t){2aH + /i) + ^^^iLMjAih^^ Clearly degt{polyt+i{ii,(J,t)) = i + and the 
degree in terms of /z and cr^ increases by at most one. To get from polyi to polyi+i , each coefficient is multiplied 
by 2 in the first product, and multiplied by at most i in the second term because of the differentiation. Thus 
if c is the maximum magnitude of a coefficient oi polyi, the maximum magnitude of a coefficient of polyi+i 
will be at most (2 + i)c, from which the claim follows. □ 
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